AutoMem: Automated framework trains LLMs to manage memory as a learnable cognitive skill
AutoMem is a new framework that treats memory management in LLMs as a trainable skill, using two optimization loops: one that iteratively revises memory structure via trajectory review by a strong LLM, and one that distills good memory decisions into direct training signal for the agent model. Evaluated on three long-horizon procedurally generated games (Crafter, MiniHack, NetHack), optimizing memory alone yielded 2x-4x performance improvements, bringing a 32B open-weight model competitive with frontier systems like Claude Opus 4.5 and Gemini 3.1 Pro Thinking. The work draws on cognitive science concepts of metamemory and demonstrates that memory management is an independently learnable, high-leverage capability for long-horizon agentic tasks.
Related guides (3)
Related events (8)
Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL
Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.
MemOS: Self-Evolving Memory OS for LLM Agents with Hybrid Retrieval and Token Savings
MemOS is an open-source TypeScript project providing a memory operating system layer for LLM and AI agents, featuring ultra-persistent memory, hybrid retrieval, and cross-task skill reuse. The project claims 35.24% token savings through its memory management approach. It has accumulated 9,329 GitHub stars with moderate daily momentum (+67). The system targets agent memory persistence and efficiency as a foundational infrastructure component.
FluxMem: Connectivity-Evolving Memory Framework for LLM Agents
FluxMem proposes a heterogeneous graph-based memory framework for LLM agents that continuously evolves its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. Unlike static memory repositories, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills successful trajectories into reusable procedural circuits. The system is guided by a single metric for memory generalizability and evolutionary maturity, achieving state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.
ManimAgent: Self-evolving multimodal agent with cross-task episodic memory for code generation
ManimAgent is a multimodal agent system that accumulates reflection experience across tasks via a dual-channel Episodic Memory Bank, without weight updates or human-curated seeds. The agent generates Python/Manim animations from scientific paper sections, and a vision-language model scores rendered keyframes to populate positive (success rationales) and negative (failure patterns) memory channels. On a fixed-probe evaluation, Pass@1 improves and reflection rounds decrease as memory grows, outperforming no-memory, RAG, and shuffled-memory baselines. The work addresses a known limitation of single-episode reflection in LLM agents by enabling persistent, self-generated learning across task boundaries.
AgenticSTS: Bounded-memory testbed for studying long-horizon LLM agent decisions in Slay the Spire 2
Researchers introduce AgenticSTS, a testbed for studying long-horizon LLM agents under a bounded-memory contract where each decision is assembled from typed retrieval rather than appending a raw cross-decision transcript. The system is instantiated in Slay the Spire 2, a stochastic deck-building game requiring hundreds of sequential decisions, chosen because frontier LLMs currently win zero games at the lowest difficulty against a 16% human baseline. Ablation experiments show enabling a strategic skill layer improves win rate from 3/10 to 6/10, though sample sizes are too small for statistical significance. The authors release 298 completed trajectories, memory snapshots, and analysis scripts as a reusable methodology for isolating how explicit memory layers affect agent performance.
Infini Memory: Topic-structured persistent memory architecture for long-term LLM agents
Researchers propose Infini Memory, a persistent memory architecture for LLM agents that organizes memory as topic-structured documents rather than isolated records or summaries. New observations are staged in a buffer and periodically consolidated, while retrieval uses iterative agentic tool calls instead of a single lookup step. The system achieves 64.7% on MemoryAgentBench, with ablations showing complementary gains from topic-structured maintenance and iterative evidence inspection.
IMLogic benchmark and RootMem framework target implicit logical memory retrieval for personalized LLMs
Researchers introduce IMLogic, a benchmark for evaluating implicit logical memory retrieval in long-dialogue personalized LLM scenarios, addressing gaps in existing semantic-similarity-based retrieval methods. They also propose RootMem, a plug-and-play framework that distills user histories into structured 'root memories' and uses an LLM-based router to activate logically relevant memories alongside semantic retrieval. Experiments show RootMem outperforms retrieval baselines and improves existing memory agents. The work targets a concrete weakness in current personalized LLM memory systems where logically critical memories lack semantic overlap with queries.
EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments
Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.


