Entity · paper

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

paperactivewatch-remember-reason-human-view-video-understanding-with-mllms-2429353e·1 events·first seen Jun 8, 2026

Aliases: Watch, Remember, Reason: Human-View Video Understanding with MLLMs

More like this (12)

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving AIR: Adaptive Interleaved Reasoning with Code in MLLMs Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning Towards Root Memories: Benchmarking and Enhancing Implicit Logical Memory Retrieval for Personalized LLMs Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier Multimodal Large Language Models Computational Humor with Multimodal LLMs: Methods, Datasets, Evaluation, and Challenges Extending LLM Context via Associative Recurrent Memory StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs visual language model Reasoning Language Models

Recent events (1)

4arXiv · cs.AI·Jun 8, 2026·source ↗

Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework

A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.

Long Context Evolution Multimodal Progress Watch, Remember, Reason: Human-View Video Understanding with MLLMs