6arXiv cs.CL (Computation and Language)·23d ago

MemTrace: Framework for Tracing and Attributing Errors in LLM Memory Systems

MemTrace introduces a framework that converts LLM memory pipelines into executable memory evolution graphs to enable fine-grained error tracing and root-cause attribution. The authors construct MemTraceBench, a benchmark covering Long-Context, RAG, Mem0, and EverMemOS memory systems, to systematically characterize memory failure modes such as information loss and retrieval misalignment. An automatic attribution method iteratively traces operation subgraphs to pinpoint failures, and the resulting signals are used to guide prompt optimization in a closed-loop system that improves end-task performance by up to 7.62%.

Long Context Evolution Evaluation and Benchmarking Agent and Tool Ecosystem Mem0 memory evolution graph MemTrace MemTraceBench EverMemOS Retrieval-Augmented Generation Zhejiang University NLP Group (ZJUNLP)

Related guides (3)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Github Trending·29d ago·source ↗

MemOS: Self-Evolving Memory OS for LLM Agents with Hybrid Retrieval and Token Savings

MemOS is an open-source TypeScript project providing a memory operating system layer for LLM and AI agents, featuring ultra-persistent memory, hybrid retrieval, and cross-task skill reuse. The project claims 35.24% token savings through its memory management approach. It has accumulated 9,329 GitHub stars with moderate daily momentum (+67). The system targets agent memory persistence and efficiency as a foundational infrastructure component.

Inference Economics Agent and Tool Ecosystem MemOS MemTensor

6Mistral Ai News·1mo ago·source ↗

Mistral AI Engineering Deep Dive: Debugging a Memory Leak in vLLM

Mistral AI's engineering team investigated a memory leak in vLLM that appeared exclusively during disaggregated prefill/decode serving with Mistral Medium 3.1 and graph compilation enabled, causing ~400 MB/min RSS growth. The leak was not visible in heap profilers (Memray, Guppy3, Heaptrack), pointing to off-heap memory allocation tied to NIXL/UCX-based KV cache transfer over InfiniBand. The post is the first in a new Engineering Deep Dive series and documents a methodical descent from Python-level tools to kernel-level tracing to isolate the root cause.

Training Infrastructure Inference Economics Mistral AI Prefill/Decode Disaggregation Mistral-medium +7 more

6arXiv · cs.AI·23d ago·source ↗

FluxMem: Connectivity-Evolving Memory Framework for LLM Agents

FluxMem proposes a heterogeneous graph-based memory framework for LLM agents that continuously evolves its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. Unlike static memory repositories, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills successful trajectories into reusable procedural circuits. The system is guided by a single metric for memory generalizability and evolutionary maturity, achieving state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.

Long Context Evolution Evaluation and Benchmarking heterogeneous graph memory LightMem GAIA +6 more

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

5arXiv · cs.CL·11d ago·source ↗

Infini Memory: Topic-structured persistent memory architecture for long-term LLM agents

Researchers propose Infini Memory, a persistent memory architecture for LLM agents that organizes memory as topic-structured documents rather than isolated records or summaries. New observations are staged in a buffer and periodically consolidated, while retrieval uses iterative agentic tool calls instead of a single lookup step. The system achieves 64.7% on MemoryAgentBench, with ablations showing complementary gains from topic-structured maintenance and iterative evidence inspection.

Evaluation and Benchmarking Agent and Tool Ecosystem Infini Memory MemoryAgentBench

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

5arXiv · cs.CL·8d ago·source ↗

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.

Evaluation and Benchmarking Agent and Tool Ecosystem EvoArena GAIA LoCoMo +1 more

5arXiv · cs.CL·15d ago·source ↗

PropMe framework distinguishes memorization capability from propensity in LLMs

A new arXiv preprint introduces PropMe, a framework that separates whether LLMs can be forced to reproduce training data (capability) from whether they do so under ordinary use (propensity). The authors also release SimpleTrace, a lightweight pipeline using infini-gram to attribute model outputs to training corpora. Evaluating two open models on Common Pile and Dynaword, they find a consistent gap: adversarial prefix attacks elicit strong memorization, but propensity scores remain low in non-adversarial settings. The paper argues memorization audits should report both worst-case extractability and ordinary leakage propensity.

Evaluation and Benchmarking AI Safety Research PropMe SimpleTrace Dynaword +4 more