Watch, Remember, Reason: Human-View Video Understanding with MLLMs
watch-remember-reason-human-view-video-understanding-with-mllms-2429353e·1 events·first seen 9d agoAliases: Watch, Remember, Reason: Human-View Video Understanding with MLLMs
More like this (12)
Recent events (1)
Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework
A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.