5arXiv cs.LG (Machine Learning)·11d ago

Echo-Memory: Controlled study isolates memory mechanisms in action-conditioned world models

Echo-Memory is a controlled benchmark study comparing memory mechanisms in action-conditioned video world models, fixing all other variables (backbone, optimizer, evaluation) to isolate how history storage and retrieval affect scene consistency across camera departures and returns. The study compares raw context, compression-based memory, spatial summaries, and state-space recurrence under a shared video diffusion backbone. Key findings: raw context is a strong baseline for open-domain return; aggressive compression loses salient evidence; and block-wise state-space recurrence is the strongest mechanism for remembering world state across long horizons. The three-branch evaluation protocol reveals that replay fidelity is not a reliable proxy for true world memory.

Evaluation and Benchmarking Multimodal Progress Echo-Memory

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·8d ago·source ↗

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.

Evaluation and Benchmarking Agent and Tool Ecosystem EvoArena GAIA LoCoMo +1 more

6arXiv · cs.AI·12d ago·source ↗

MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding

MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.

Long Context Evolution Agent and Tool Ecosystem MemDreamer Hierarchical Graph Memory Observation-Reason-Action +1 more

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

6arXiv · cs.CL·23d ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

5arXiv · cs.CL·12d ago·source ↗

M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions

Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.

Evaluation and Benchmarking Agent and Tool Ecosystem M³Exam M³Proctor +1 more

6arXiv · cs.CL·5d ago·source ↗

GitOfThoughts: Git-based agent memory substrate with sobering findings on memory utility for novel problems

Researchers introduce GitOfThoughts, a system that stores LLM reasoning trees as git repositories, enabling replayable, auditable, and mergeable agent memory at low engineering cost. Across five memory substrates (none, markdown, vector, graph, git), two benchmarks, and two model scales with pre-registered replications, the paper finds that no memory format reliably improves accuracy on novel problems. Memory only helps above a 'copyability threshold' (similarity >~0.8), where retrieved cases are near-duplicates of the current problem — and even then, the gain is answer retrieval rather than method transfer. The paper also documents a retracted result and refuted hypothesis, modeling a rigorous evaluation standard.

Evaluation and Benchmarking Agent and Tool Ecosystem GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge GitOfThoughts

6arXiv · cs.LG·29d ago·source ↗

Episodic Context and Persistent 3D World Models Enable Curiosity-Driven Exploration in Photorealistic Environments

This paper addresses the failure modes of curiosity-driven RL in complex 3D environments, where agents revisit forgotten states and get trapped in local loops due to lacking spatial persistence and episodic memory. The authors combine an online 3D reconstruction as a persistent world model with a sequence-model policy over RGB observations to maintain episodic trajectory context. Trained purely via intrinsic curiosity on HM3D, the agent outperforms RL-based active mapping baselines and zero-shot generalizes to Gibson and AI-generated environments. The approach also enables efficient downstream task adaptation for apple picking and image-goal navigation.

Evaluation and Benchmarking Agent and Tool Ecosystem online 3D reconstruction curiosity-driven reinforcement learning Remember to be Curious +3 more

4arXiv · cs.CL·24d ago·source ↗

ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents

This paper introduces ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval in memory-augmented language agents deployed for emotional support applications. The benchmark includes over 1,800 memory-augmented dialogues grounded in Maslow's hierarchy of needs, with structured mappings between emotional needs and supportive memory types. Experiments show that both embedding-based and LLM-driven retrieval paradigms fall significantly short of golden memory conditions on empathy scores, and while chain-of-thought prompting helps, a substantial performance gap remains. The work highlights a systematic gap in current agent memory systems when applied to affective rather than purely factual retrieval tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem ENPMR-Bench chain-of-thought prompting Maslow's Hierarchy of Needs +1 more