Almanac
paper

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

paperactiveprovisionalwatch-remember-reason-human-view-video-understanding-with-mllms-2429353e·1 events·first seen 9d ago

Aliases: Watch, Remember, Reason: Human-View Video Understanding with MLLMs

More like this (12)

Recent events (1)

4arXiv · cs.AI·9d ago·source ↗

Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework

A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.