5arXiv cs.CL (Computation and Language)·8d ago

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.

Evaluation and Benchmarking Agent and Tool Ecosystem EvoArena GAIA LoCoMo EvoMem

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·23d ago·source ↗

FluxMem: Connectivity-Evolving Memory Framework for LLM Agents

FluxMem proposes a heterogeneous graph-based memory framework for LLM agents that continuously evolves its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. Unlike static memory repositories, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills successful trajectories into reusable procedural circuits. The system is guided by a single metric for memory generalizability and evolutionary maturity, achieving state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.

Long Context Evolution Evaluation and Benchmarking heterogeneous graph memory LightMem GAIA +6 more

5arXiv · cs.AI·11d ago·source ↗

OmniGameArena: UE5 benchmark for VLM game agents with multi-round improvement dynamics

Researchers introduce OmniGameArena, a real-time benchmark of twelve Unreal Engine 5 games spanning solo, PvP, and cooperative play, designed to evaluate vision-language model agents under unified protocols across commercial VLMs, open-weight VLMs, and specialized game policies. The benchmark introduces the Improvement Dynamics Curve (IDC), an agentic-reflection harness where a tool-using LLM autonomously refines skill prompts across multiple rounds, exposing how agent performance evolves and generalizes beyond a single cold-start score. Twelve VLM agents are evaluated on the leaderboard, with four top agents further analyzed under IDC. The work addresses gaps in existing game benchmarks that report only single-attempt scores and lack multi-agent or cooperative evaluation modes.

Evaluation and Benchmarking Agent and Tool Ecosystem OmniGameArena Unreal Engine 5 Improvement Dynamics Curve

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

4arXiv · cs.CL·24d ago·source ↗

ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents

This paper introduces ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval in memory-augmented language agents deployed for emotional support applications. The benchmark includes over 1,800 memory-augmented dialogues grounded in Maslow's hierarchy of needs, with structured mappings between emotional needs and supportive memory types. Experiments show that both embedding-based and LLM-driven retrieval paradigms fall significantly short of golden memory conditions on empathy scores, and while chain-of-thought prompting helps, a substantial performance gap remains. The work highlights a systematic gap in current agent memory systems when applied to affective rather than purely factual retrieval tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem ENPMR-Bench chain-of-thought prompting Maslow's Hierarchy of Needs +1 more

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more

6arXiv · cs.CL·15d ago·source ↗

MLEvolve: Self-evolving multi-agent framework for automated ML algorithm discovery

MLEvolve is a new LLM-based multi-agent framework for end-to-end machine learning algorithm discovery, addressing limitations of existing MLE agents including information isolation and memoryless search. The system introduces Progressive MCGS (a graph-extended tree search), Retrospective Memory for experience accumulation, and decoupled strategic planning from code generation. Evaluated on MLE-Bench, it achieves state-of-the-art medal and valid submission rates within a 12-hour budget, and also outperforms AlphaEvolve on mathematical algorithm optimization tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem MLEvolve MLE-bench Progressive MCGS +3 more

5arXiv · cs.AI·10d ago·source ↗

EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents

EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.

Evaluation and Benchmarking Agent and Tool Ecosystem ACE DeepSeek V4 Qwen3-4B-Instruct +2 more

5arXiv · cs.LG·11d ago·source ↗

Echo-Memory: Controlled study isolates memory mechanisms in action-conditioned world models

Echo-Memory is a controlled benchmark study comparing memory mechanisms in action-conditioned video world models, fixing all other variables (backbone, optimizer, evaluation) to isolate how history storage and retrieval affect scene consistency across camera departures and returns. The study compares raw context, compression-based memory, spatial summaries, and state-space recurrence under a shared video diffusion backbone. Key findings: raw context is a strong baseline for open-domain return; aggressive compression loses salient evidence; and block-wise state-space recurrence is the strongest mechanism for remembering world state across long horizons. The three-branch evaluation protocol reveals that replay fidelity is not a reliable proxy for true world memory.

Evaluation and Benchmarking Multimodal Progress Echo-Memory