5arXiv cs.CL (Computation and Language)·42h ago

MEMPROBE: Benchmark for auditing long-term agent memory via hidden user-state recovery

MEMPROBE is a new benchmark that evaluates long-term memory in LLM agents by treating memory as an auditable artifact rather than measuring it only through downstream task performance. After a memory-equipped agent assists simulated users across a trajectory of tasks, the benchmark attempts to reconstruct a hidden, taxonomy-anchored user-state bank from the agent's memory store. Testing across 5 memory systems and 50 simulated users with 31 hidden dimensions each, the authors find that task completion and memory recovery are largely independent capabilities — task success nearly saturates even for memoryless baselines, while structured user-state recovery remains moderate (~0.6) and degrades under top-k retrieval constraints.

Evaluation and Benchmarking Agent and Tool Ecosystem MemProbe

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

4arXiv · cs.CL·29d ago·source ↗

ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents

This paper introduces ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval in memory-augmented language agents deployed for emotional support applications. The benchmark includes over 1,800 memory-augmented dialogues grounded in Maslow's hierarchy of needs, with structured mappings between emotional needs and supportive memory types. Experiments show that both embedding-based and LLM-driven retrieval paradigms fall significantly short of golden memory conditions on empathy scores, and while chain-of-thought prompting helps, a substantial performance gap remains. The work highlights a systematic gap in current agent memory systems when applied to affective rather than purely factual retrieval tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem ENPMR-Bench chain-of-thought prompting Maslow's Hierarchy of Needs +1 more

6arXiv · cs.CL·42h ago·source ↗

Systematic evaluation of 12 agent memory systems from a data management perspective

A new arXiv preprint proposes an analytical framework decomposing agent memory into four core modules—representation/storage, extraction, retrieval/routing, and maintenance—and evaluates 12 representative memory systems across five benchmark workloads spanning 11 datasets. The study finds no single architecture dominates across scenarios; effectiveness depends on alignment between memory structure and workload bottleneck. Fine-grained ablation studies quantify effects on retrieval precision, update correctness, and long-horizon stability, and reveal that localized maintenance is more cost-efficient than global reorganization. Code is publicly released.

Long Context Evolution Evaluation and Benchmarking OpenDataBox Are We Ready For An Agent-Native Memory System?+1 more

6arXiv · cs.CL·2d ago·source ↗

TriggerBench: A benchmark for evaluating prospective memory in LLMs

Researchers introduce TriggerBench, a benchmark evaluating prospective memory (PM) in LLMs — the ability to spontaneously recall and act on latent constraints without explicit prompting. The benchmark spans five dimensions across daily assistant and professional workflow scenarios, and reveals that PM is substantially harder than retrospective memory, decaying sharply with context length while retrospective memory near-saturates at 100K tokens. Key findings include a precision-recall trade-off in PM, attentional fragility under concurrent requests, and a novel result that PM accuracy correlates with spare reasoning capacity as measured against AIME-2025 math performance.

Long Context Evolution Evaluation and Benchmarking TriggerBench AIME 2025 +1 more

5arXiv · cs.CL·17d ago·source ↗

M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions

Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.

Evaluation and Benchmarking Agent and Tool Ecosystem M³Exam M³Proctor +1 more

5arXiv · cs.CL·13d ago·source ↗

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.

Evaluation and Benchmarking Agent and Tool Ecosystem EvoArena GAIA LoCoMo +1 more

6arXiv · cs.CL·28d ago·source ↗

MemTrace: Framework for Tracing and Attributing Errors in LLM Memory Systems

MemTrace introduces a framework that converts LLM memory pipelines into executable memory evolution graphs to enable fine-grained error tracing and root-cause attribution. The authors construct MemTraceBench, a benchmark covering Long-Context, RAG, Mem0, and EverMemOS memory systems, to systematically characterize memory failure modes such as information loss and retrieval misalignment. An automatic attribution method iteratively traces operation subgraphs to pinpoint failures, and the resulting signals are used to guide prompt optimization in a closed-loop system that improves end-task performance by up to 7.62%.

Long Context Evolution Evaluation and Benchmarking Mem0 memory evolution graph MemTrace +5 more

6arXiv · cs.CL·10d ago·source ↗

GitOfThoughts: Git-based agent memory substrate with sobering findings on memory utility for novel problems

Researchers introduce GitOfThoughts, a system that stores LLM reasoning trees as git repositories, enabling replayable, auditable, and mergeable agent memory at low engineering cost. Across five memory substrates (none, markdown, vector, graph, git), two benchmarks, and two model scales with pre-registered replications, the paper finds that no memory format reliably improves accuracy on novel problems. Memory only helps above a 'copyability threshold' (similarity >~0.8), where retrieved cases are near-duplicates of the current problem — and even then, the gain is answer retrieval rather than method transfer. The paper also documents a retracted result and refuted hypothesis, modeling a rigorous evaluation standard.

Evaluation and Benchmarking Agent and Tool Ecosystem GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge GitOfThoughts