5arXiv cs.AI (Artificial Intelligence)·18h ago

ManimAgent: Self-evolving multimodal agent with cross-task episodic memory for code generation

ManimAgent is a multimodal agent system that accumulates reflection experience across tasks via a dual-channel Episodic Memory Bank, without weight updates or human-curated seeds. The agent generates Python/Manim animations from scientific paper sections, and a vision-language model scores rendered keyframes to populate positive (success rationales) and negative (failure patterns) memory channels. On a fixed-probe evaluation, Pass@1 improves and reflection rounds decrease as memory grows, outperforming no-memory, RAG, and shuffled-memory baselines. The work addresses a known limitation of single-episode reflection in LLM agents by enabling persistent, self-generated learning across task boundaries.

Evaluation and Benchmarking Agent and Tool Ecosystem ManimAgent Manim

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

4Github Trending·1mo ago·source ↗

agentmemory: Persistent Memory for AI Coding Agents

agentmemory is an open-source TypeScript library providing persistent memory for AI coding agents, designed based on real-world benchmarks. The repository has accumulated 13,772 total stars with a notable single-day gain of 1,626 stars, indicating strong community traction. It targets the agent tool ecosystem by addressing memory continuity across coding agent sessions.

Agent and Tool Ecosystem agentmemory rohitg00

6arXiv · cs.CL·1mo ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

5Github Trending·1mo ago·source ↗

claude-mem: Persistent Cross-Session Memory Layer for AI Coding Agents

claude-mem is an open-source TypeScript library that provides persistent context across sessions for AI coding agents. It captures agent activity during sessions, compresses it using AI, and injects relevant context into future sessions. The tool claims compatibility with Claude Code, OpenAI Codex, Gemini, GitHub Copilot, and other coding agents. The repository has accumulated 78,579 stars with 319 added today, indicating strong community traction.

Long Context Evolution Agent and Tool Ecosystem Claude Code claude-mem thedotmack +2 more

6arXiv · cs.AI·22d ago·source ↗

MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding

MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.

Long Context Evolution Agent and Tool Ecosystem MemDreamer Hierarchical Graph Memory Observation-Reason-Action +1 more

6arXiv · cs.CL·12h ago·source ↗

WorldEvolver: Self-Evolving World Models for LLM Agent Planning via Test-Time Memory Revision

Researchers introduce WorldEvolver, a framework that equips LLM agents with self-improving world models that revise their context at deployment time without updating model parameters. The system combines episodic memory (retrieval-based simulation), semantic memory (heuristic rule extraction from prediction errors), and selective foresight (confidence-based filtering). Evaluated on ALFWorld and ScienceWorld benchmarks, WorldEvolver achieves state-of-the-art world model prediction accuracy and improved downstream agent success rates across three backbone models. The work addresses a key challenge in long-horizon agent planning: unreliable foresight that can degrade rather than improve decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem ALFWorld AgentBoard Word2World +2 more

4Github Trending·1mo ago·source ↗

MemOS: Self-Evolving Memory OS for LLM Agents with Hybrid Retrieval and Token Savings

MemOS is an open-source TypeScript project providing a memory operating system layer for LLM and AI agents, featuring ultra-persistent memory, hybrid retrieval, and cross-task skill reuse. The project claims 35.24% token savings through its memory management approach. It has accumulated 9,329 GitHub stars with moderate daily momentum (+67). The system targets agent memory persistence and efficiency as a foundational infrastructure component.

Inference Economics Agent and Tool Ecosystem MemOS MemTensor

6arXiv · cs.AI·1mo ago·source ↗

FluxMem: Connectivity-Evolving Memory Framework for LLM Agents

FluxMem proposes a heterogeneous graph-based memory framework for LLM agents that continuously evolves its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. Unlike static memory repositories, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills successful trajectories into reusable procedural circuits. The system is guided by a single metric for memory generalizability and evolutionary maturity, achieving state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.

Long Context Evolution Evaluation and Benchmarking heterogeneous graph memory LightMem GAIA +6 more