6arXiv cs.AI (Artificial Intelligence)·12d ago

MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding

MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.

Long Context Evolution Agent and Tool Ecosystem Multimodal Progress MemDreamer Hierarchical Graph Memory Observation-Reason-Action

Related guides (3)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·12d ago·source ↗

Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework

A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.

Long Context Evolution Multimodal Progress Watch, Remember, Reason: Human-View Video Understanding with MLLMs

6arXiv · cs.CL·23d ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

5arXiv · cs.CL·11d ago·source ↗

REAL: Reasoning-enhanced temporal graph framework for LLM long-term memory management

REAL is a new framework that represents LLM conversational memory as a temporal, confidence-aware directed property graph, where atomic facts carry validity intervals, confidence scores, and exploration intent labels. It addresses three limitations of prior memory systems: flat text structures, destructive overwrites of evolving facts, and passive retrieval. The system uses non-destructive temporal updates, semantic evaluator-guided hybrid beam search, and counterfactual inference to repair incomplete retrieval states. Experiments show a 22.72% average improvement over flat-text, graph-based, and existing memory baselines.

Long Context Evolution Agent and Tool Ecosystem REAL

6arXiv · cs.CL·2d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

Inference Economics Agent and Tool Ecosystem OmniAgent Qwen2.5-VL-72B LVBench +4 more

6arXiv · cs.CL·25d ago·source ↗

STORM: Internalized Spatial-Temporal Reasoning for Video-Language Models via Latent Trajectories

STORMS is a two-stage training framework that teaches large vision-language models to perform spatial-temporal video reasoning through bounded continuous latent trajectories rather than explicit textual chain-of-thought, keyframe selection, or external tool use. In Stage I, latent tokens are aligned with thought-video representations derived from generated videos; in Stage II, answer-only supervision internalizes the reasoning process. At inference time, no video regeneration or frame reinsertion is required, reducing latency and engineering complexity. Evaluations on VideoMME, MVBench, TempCompass, and MMVU show improved accuracy with substantially lower inference overhead versus tool-based pipelines.

Inference Economics Agent and Tool Ecosystem MVBench STORMS TempCompass +5 more

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

6arXiv · cs.CL·5d ago·source ↗

GitOfThoughts: Git-based agent memory substrate with sobering findings on memory utility for novel problems

Researchers introduce GitOfThoughts, a system that stores LLM reasoning trees as git repositories, enabling replayable, auditable, and mergeable agent memory at low engineering cost. Across five memory substrates (none, markdown, vector, graph, git), two benchmarks, and two model scales with pre-registered replications, the paper finds that no memory format reliably improves accuracy on novel problems. Memory only helps above a 'copyability threshold' (similarity >~0.8), where retrieved cases are near-duplicates of the current problem — and even then, the gain is answer retrieval rather than method transfer. The paper also documents a retracted result and refuted hypothesis, modeling a rigorous evaluation standard.

Evaluation and Benchmarking Agent and Tool Ecosystem GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge GitOfThoughts

5arXiv · cs.CL·11d ago·source ↗

DocTrace: Structure-Aware On-Demand Hypergraph Memory for Long-Document QA

Researchers introduce DocTrace, a multi-agent RAG framework for long-document question answering that uses query-triggered knowledge organization rather than costly query-agnostic preprocessing. The system combines a lightweight document structural tree index, on-demand hypergraph working memory, and a graph-structured experience memory that stores successful reasoning plans for reuse. Evaluated on four long-document QA datasets, DocTrace outperforms the strongest baseline (ComoRAG) by up to 8.85% F1 and 4.40% EM while reducing computational cost by 53.32%.

Long Context Evolution Agent and Tool Ecosystem ComoRAG DocTrace Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering