7arXiv cs.CL (Computation and Language)·36h ago

AutoMem: Automated framework trains LLMs to manage memory as a learnable cognitive skill

AutoMem is a new framework that treats memory management in LLMs as a trainable skill, using two optimization loops: one that iteratively revises memory structure via trajectory review by a strong LLM, and one that distills good memory decisions into direct training signal for the agent model. Evaluated on three long-horizon procedurally generated games (Crafter, MiniHack, NetHack), optimizing memory alone yielded 2x-4x performance improvements, bringing a 32B open-weight model competitive with frontier systems like Claude Opus 4.5 and Gemini 3.1 Pro Thinking. The work draws on cognitive science concepts of metamemory and demonstrates that memory management is an independently learnable, high-leverage capability for long-horizon agentic tasks.

Long Context Evolution Open Weights Progress Agent and Tool Ecosystem Gemini 3.1 Pro Claude Opus 4.6 NetHack Crafter NetHack MiniHack MiniHack Gemini 3.1 Pro Thinking AutoMem

Related guides (3)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Claude Opus 4.6

Claude Opus 4.6: Anthropic's Leap into Million-Token, Agentic AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Free AI Models Caught Up to the Frontier

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

4Github Trending·1mo ago·source ↗

MemOS: Self-Evolving Memory OS for LLM Agents with Hybrid Retrieval and Token Savings

MemOS is an open-source TypeScript project providing a memory operating system layer for LLM and AI agents, featuring ultra-persistent memory, hybrid retrieval, and cross-task skill reuse. The project claims 35.24% token savings through its memory management approach. It has accumulated 9,329 GitHub stars with moderate daily momentum (+67). The system targets agent memory persistence and efficiency as a foundational infrastructure component.

Inference Economics Agent and Tool Ecosystem MemOS MemTensor

6arXiv · cs.AI·1mo ago·source ↗

FluxMem: Connectivity-Evolving Memory Framework for LLM Agents

FluxMem proposes a heterogeneous graph-based memory framework for LLM agents that continuously evolves its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. Unlike static memory repositories, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills successful trajectories into reusable procedural circuits. The system is guided by a single metric for memory generalizability and evolutionary maturity, achieving state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.

Long Context Evolution Evaluation and Benchmarking heterogeneous graph memory LightMem GAIA +6 more

5arXiv · cs.AI·3d ago·source ↗

ManimAgent: Self-evolving multimodal agent with cross-task episodic memory for code generation

ManimAgent is a multimodal agent system that accumulates reflection experience across tasks via a dual-channel Episodic Memory Bank, without weight updates or human-curated seeds. The agent generates Python/Manim animations from scientific paper sections, and a vision-language model scores rendered keyframes to populate positive (success rationales) and negative (failure patterns) memory channels. On a fixed-probe evaluation, Pass@1 improves and reflection rounds decrease as memory grows, outperforming no-memory, RAG, and shuffled-memory baselines. The work addresses a known limitation of single-episode reflection in LLM agents by enabling persistent, self-generated learning across task boundaries.

Evaluation and Benchmarking Agent and Tool Ecosystem ManimAgent Manim

5arXiv · cs.AI·18h ago·source ↗

AgenticSTS: Bounded-memory testbed for studying long-horizon LLM agent decisions in Slay the Spire 2

Researchers introduce AgenticSTS, a testbed for studying long-horizon LLM agents under a bounded-memory contract where each decision is assembled from typed retrieval rather than appending a raw cross-decision transcript. The system is instantiated in Slay the Spire 2, a stochastic deck-building game requiring hundreds of sequential decisions, chosen because frontier LLMs currently win zero games at the lowest difficulty against a 16% human baseline. Ablation experiments show enabling a strategic skill layer improves win rate from 3/10 to 6/10, though sample sizes are too small for statistical significance. The authors release 298 completed trajectories, memory snapshots, and analysis scripts as a reusable methodology for isolating how explicit memory layers affect agent performance.

Evaluation and Benchmarking Agent and Tool Ecosystem AgenticSTS Slay the Spire 2

5arXiv · cs.CL·23d ago·source ↗

Infini Memory: Topic-structured persistent memory architecture for long-term LLM agents

Researchers propose Infini Memory, a persistent memory architecture for LLM agents that organizes memory as topic-structured documents rather than isolated records or summaries. New observations are staged in a buffer and periodically consolidated, while retrieval uses iterative agentic tool calls instead of a single lookup step. The system achieves 64.7% on MemoryAgentBench, with ablations showing complementary gains from topic-structured maintenance and iterative evidence inspection.

Evaluation and Benchmarking Agent and Tool Ecosystem Infini Memory MemoryAgentBench

5arXiv · cs.CL·10d ago·source ↗

IMLogic benchmark and RootMem framework target implicit logical memory retrieval for personalized LLMs

Researchers introduce IMLogic, a benchmark for evaluating implicit logical memory retrieval in long-dialogue personalized LLM scenarios, addressing gaps in existing semantic-similarity-based retrieval methods. They also propose RootMem, a plug-and-play framework that distills user histories into structured 'root memories' and uses an LLM-based router to activate logically relevant memories alongside semantic retrieval. Experiments show RootMem outperforms retrieval baselines and improves existing memory agents. The work targets a concrete weakness in current personalized LLM memory systems where logically critical memories lack semantic overlap with queries.

Evaluation and Benchmarking Agent and Tool Ecosystem Towards Root Memories: Benchmarking and Enhancing Implicit Logical Memory Retrieval for Personalized LLMs IMLogic RootMem

5arXiv · cs.CL·21d ago·source ↗

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.

Evaluation and Benchmarking Agent and Tool Ecosystem EvoArena GAIA LoCoMo +1 more