6arXiv cs.AI (Artificial Intelligence)·20h ago

Survey of persistent memory, state, and governance in always-on LLM agents introduces AOEP-v0 evaluation protocol

A new arXiv survey paper examines 'always-on agents' — LLM-based systems whose behavior depends on durable state accumulated across interactions — through six diagnostic axes covering authority, scope, mutability, provenance, recoverability, and actionability. The authors analyze a 435-work corpus and find the literature over-indexes on state accumulation and retrieval while under-serving governance, recovery, and relinquishment. To address this gap, they introduce the Always-On Evaluation Protocol (AOEP-v0), a pilot evaluation contract that scores state mutation and recovery obligations rather than answer quality. The work connects agent design to databases, distributed systems, formal methods, capability security, and machine unlearning.

Evaluation and Benchmarking AI Safety Research Agent and Tool Ecosystem Always-OnAgents: A Survey of Persistent Memory, State, and Governance in LLM Agents Always-On Evaluation Protocol (AOEP-v0)

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·6d ago·source ↗

Systematic evaluation of 12 agent memory systems from a data management perspective

A new arXiv preprint proposes an analytical framework decomposing agent memory into four core modules—representation/storage, extraction, retrieval/routing, and maintenance—and evaluates 12 representative memory systems across five benchmark workloads spanning 11 datasets. The study finds no single architecture dominates across scenarios; effectiveness depends on alignment between memory structure and workload bottleneck. Fine-grained ablation studies quantify effects on retrieval precision, update correctness, and long-horizon stability, and reveal that localized maintenance is more cost-efficient than global reorganization. Code is publicly released.

Long Context Evolution Evaluation and Benchmarking OpenDataBox Are We Ready For An Agent-Native Memory System?+1 more

5arXiv · cs.CL·19d ago·source ↗

Survey: Agentic Environment Engineering for LLMs — Modeling, Synthesis, Evaluation, and Application

A comprehensive arXiv survey systematically reviews the design and engineering of interactive environments for LLM-based agents, covering the full lifecycle from environment modeling and synthesis to evaluation and application. The paper categorizes environments across eight attributes and eight domains, introduces symbolic and neural synthesis paradigms, and characterizes four pathways for agent-environment co-evolution including memory-centric, orchestration-centric, trajectory-centric, and exploration-centric approaches. It also identifies three paradigms of environment evolution (neural-driven, difficulty-driven, scaling-driven) and proposes future directions such as Environment-as-a-Service and multi-agent environments. This is a reference-organizing contribution for the rapidly growing agent tooling and evaluation space.

Evaluation and Benchmarking Agent and Tool Ecosystem Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

5arXiv · cs.CL·6d ago·source ↗

MEMPROBE: Benchmark for auditing long-term agent memory via hidden user-state recovery

MEMPROBE is a new benchmark that evaluates long-term memory in LLM agents by treating memory as an auditable artifact rather than measuring it only through downstream task performance. After a memory-equipped agent assists simulated users across a trajectory of tasks, the benchmark attempts to reconstruct a hidden, taxonomy-anchored user-state bank from the agent's memory store. Testing across 5 memory systems and 50 simulated users with 31 hidden dimensions each, the authors find that task completion and memory recovery are largely independent capabilities — task success nearly saturates even for memoryless baselines, while structured user-state recovery remains moderate (~0.6) and degrades under top-k retrieval constraints.

Evaluation and Benchmarking Agent and Tool Ecosystem MemProbe

6arXiv · cs.AI·18d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.

Evaluation and Benchmarking Agent and Tool Ecosystem AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility AgentBeats MCP +1 more

6arXiv · cs.CL·22d ago·source ↗

Agentopia: Long-term multi-agent life simulation framework for training LLMs on social behavior

Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.

Agent and Tool Ecosystem Alignment and RLHF Agentopia Agentopia: Long-Term Life Simulation and Learning in Agent Societies

6arXiv · cs.CL·1mo ago·source ↗

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR is an automatic evaluation framework for LLM-based agentic systems that analyzes behavior at three granularity levels: system, trace, and node. Unlike existing tools that rely on static error taxonomies or focus only on observability, it dynamically generates textual insights and integrates above the observability layer with an accessible UI. Experiments across four benchmarks and seven agentic settings demonstrate strong alignment with human-annotated errors and predictive accuracy for task success rates.

Evaluation and Benchmarking AI Safety Research Agentic CLEAR multi-level agent evaluation LLM agents +1 more

6arXiv · cs.CL·28d ago·source ↗

AgentCL: A Rigorous Evaluation Framework for Continual Learning in Language Agents

AgentCL is a new benchmark and evaluation framework designed to rigorously assess continual learning in language agents, addressing gaps in existing benchmarks that focus on retrieval over long-context documents or use naive task streams with limited cross-task analysis. The framework constructs compositional task streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, contrasting them with naive streams to measure transfer gains. The authors also introduce MemProbe, a probing method that stores interactions, insights, and skills while filtering unreliable experiences during consolidation. Empirical results across coding, deep research, and language understanding tasks show that controlled streams better distinguish memory design quality, and that naive streams can mask memory-induced degradation.

Long Context Evolution Evaluation and Benchmarking AgentCL MemProbe Continual Learning +3 more

7arXiv · cs.CL·7d ago·source ↗

Evaluation awareness in LLMs is multidimensional, not a single capability — evidence from 37 open models

A new arXiv paper characterizes 'evaluation awareness' — the ability of models to detect they are being tested and adapt behavior accordingly — across 37 open-weight models and 7 families using 8 experiments. Key findings: 24/37 models exceed chance at detecting evaluation conditions, hard refusal drops 5.8 percentage points under hypothetical framing, and compliance can rise up to +30 percentage points on HarmBench under framing shifts. Critically, the three axes of awareness (detection, behavioral manifestation, controllability) are nearly uncorrelated, leading the authors to coin the 'benchmark illusion': no single awareness score reliably predicts deployment safety.

Evaluation and Benchmarking AI Safety Research HarmBench Evaluation Awareness Is Not One Capability: Evidence from Open Language Models