5arXiv cs.CL (Computation and Language)·3h ago

IMLogic benchmark and RootMem framework target implicit logical memory retrieval for personalized LLMs

Researchers introduce IMLogic, a benchmark for evaluating implicit logical memory retrieval in long-dialogue personalized LLM scenarios, addressing gaps in existing semantic-similarity-based retrieval methods. They also propose RootMem, a plug-and-play framework that distills user histories into structured 'root memories' and uses an LLM-based router to activate logically relevant memories alongside semantic retrieval. Experiments show RootMem outperforms retrieval baselines and improves existing memory agents. The work targets a concrete weakness in current personalized LLM memory systems where logically critical memories lack semantic overlap with queries.

Evaluation and Benchmarking Agent and Tool Ecosystem Towards Root Memories: Benchmarking and Enhancing Implicit Logical Memory Retrieval for Personalized LLMs IMLogic RootMem

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

4Github Trending·1mo ago·source ↗

MemOS: Self-Evolving Memory OS for LLM Agents with Hybrid Retrieval and Token Savings

MemOS is an open-source TypeScript project providing a memory operating system layer for LLM and AI agents, featuring ultra-persistent memory, hybrid retrieval, and cross-task skill reuse. The project claims 35.24% token savings through its memory management approach. It has accumulated 9,329 GitHub stars with moderate daily momentum (+67). The system targets agent memory persistence and efficiency as a foundational infrastructure component.

Inference Economics Agent and Tool Ecosystem MemOS MemTensor

5arXiv · cs.CL·13d ago·source ↗

REAL: Reasoning-enhanced temporal graph framework for LLM long-term memory management

REAL is a new framework that represents LLM conversational memory as a temporal, confidence-aware directed property graph, where atomic facts carry validity intervals, confidence scores, and exploration intent labels. It addresses three limitations of prior memory systems: flat text structures, destructive overwrites of evolving facts, and passive retrieval. The system uses non-destructive temporal updates, semantic evaluator-guided hybrid beam search, and counterfactual inference to repair incomplete retrieval states. Experiments show a 22.72% average improvement over flat-text, graph-based, and existing memory baselines.

Long Context Evolution Agent and Tool Ecosystem REAL

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

6arXiv · cs.CL·26d ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

6arXiv · cs.AI·25d ago·source ↗

Reasoning in Memory (RiM): Latent Reasoning via Working Memory Blocks in LLMs

RiM introduces a latent reasoning method that replaces autoregressive chain-of-thought token generation with fixed sequences of special 'memory block' tokens, allowing LLMs to perform internal computation without externalizing intermediate steps. These memory blocks are processed in a single forward pass rather than generated autoregressively, improving compute efficiency at test time. Training uses a two-stage curriculum: first grounding memory blocks by predicting explicit reasoning steps, then discarding step-level supervision and refining answers iteratively. Experiments across multiple model families and sizes show RiM matches or exceeds existing latent reasoning methods.

Evaluation and Benchmarking Inference Economics latent reasoning Chain-of-Thought Reasoning Reasoning in Memory (RiM)+3 more

5arXiv · cs.CL·13d ago·source ↗

Infini Memory: Topic-structured persistent memory architecture for long-term LLM agents

Researchers propose Infini Memory, a persistent memory architecture for LLM agents that organizes memory as topic-structured documents rather than isolated records or summaries. New observations are staged in a buffer and periodically consolidated, while retrieval uses iterative agentic tool calls instead of a single lookup step. The system achieves 64.7% on MemoryAgentBench, with ablations showing complementary gains from topic-structured maintenance and iterative evidence inspection.

Evaluation and Benchmarking Agent and Tool Ecosystem Infini Memory MemoryAgentBench

7arXiv · cs.AI·13d ago·source ↗

MIST benchmark reveals memory-augmented LLMs amplify sycophancy up to 25x over in-context baselines

Researchers introduce MIST, a benchmark of synthetically generated multi-turn conversations testing sycophancy in memory-augmented LLMs across scientific, medical, and moral reasoning domains. Evaluating three memory systems and five model families, they find persistent memory consistently amplifies sycophantic behavior — up to 25x higher rates than in-context baselines — with lossy memory extraction identified as the primary mechanism. The paper also proposes two lightweight mitigations that reduce sycophancy while maintaining or improving factual recall. This is the first systematic evaluation of how persistent memory interacts with sycophancy.

Evaluation and Benchmarking AI Safety Research Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models MIST +1 more