5arXiv cs.AI (Artificial Intelligence)·16h ago

AgenticSTS: Bounded-memory testbed for studying long-horizon LLM agent decisions in Slay the Spire 2

Researchers introduce AgenticSTS, a testbed for studying long-horizon LLM agents under a bounded-memory contract where each decision is assembled from typed retrieval rather than appending a raw cross-decision transcript. The system is instantiated in Slay the Spire 2, a stochastic deck-building game requiring hundreds of sequential decisions, chosen because frontier LLMs currently win zero games at the lowest difficulty against a 16% human baseline. Ablation experiments show enabling a strategic skill layer improves win rate from 3/10 to 6/10, though sample sizes are too small for statistical significance. The authors release 298 completed trajectories, memory snapshots, and analysis scripts as a reusable methodology for isolating how explicit memory layers affect agent performance.

Evaluation and Benchmarking Agent and Tool Ecosystem AgenticSTS Slay the Spire 2

Related guides (2)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

7arXiv · cs.CL·34h ago·source ↗

AutoMem: Automated framework trains LLMs to manage memory as a learnable cognitive skill

AutoMem is a new framework that treats memory management in LLMs as a trainable skill, using two optimization loops: one that iteratively revises memory structure via trajectory review by a strong LLM, and one that distills good memory decisions into direct training signal for the agent model. Evaluated on three long-horizon procedurally generated games (Crafter, MiniHack, NetHack), optimizing memory alone yielded 2x-4x performance improvements, bringing a 32B open-weight model competitive with frontier systems like Claude Opus 4.5 and Gemini 3.1 Pro Thinking. The work draws on cognitive science concepts of metamemory and demonstrates that memory management is an independently learnable, high-leverage capability for long-horizon agentic tasks.

Long Context Evolution Open Weights Progress Gemini 3.1 Pro Claude Opus 4.6 NetHack +7 more

6arXiv · cs.CL·1mo ago·source ↗

STT-Arena: Benchmark for Adaptive Replanning Under Spatio-Temporal Dynamics in Tool-Using LLMs

STT-Arena is a new benchmark of 227 interactive tasks designed to evaluate LLMs' ability to detect mid-task disruptions and replan under spatio-temporal dynamics, covering nine conflict types and four solvability levels. Evaluation of frontier models including Claude-4.6-Opus shows less than 40% overall accuracy, revealing fundamental limitations in dynamic reasoning. The authors identify three recurring failure modes—Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification—and propose an iterative trajectory refinement technique combined with online RL to train STT-Agent-4B, a 4B-parameter model that outperforms frontier LLMs on the benchmark.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 iterative trajectory refinement spatio-temporal dynamic reasoning +5 more

6arXiv · cs.CL·25d ago·source ↗

Agentopia: Long-term multi-agent life simulation framework for training LLMs on social behavior

Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.

Agent and Tool Ecosystem Alignment and RLHF Agentopia Agentopia: Long-Term Life Simulation and Learning in Agent Societies

6arXiv · cs.CL·18d ago·source ↗

GitOfThoughts: Git-based agent memory substrate with sobering findings on memory utility for novel problems

Researchers introduce GitOfThoughts, a system that stores LLM reasoning trees as git repositories, enabling replayable, auditable, and mergeable agent memory at low engineering cost. Across five memory substrates (none, markdown, vector, graph, git), two benchmarks, and two model scales with pre-registered replications, the paper finds that no memory format reliably improves accuracy on novel problems. Memory only helps above a 'copyability threshold' (similarity >~0.8), where retrieved cases are near-duplicates of the current problem — and even then, the gain is answer retrieval rather than method transfer. The paper also documents a retracted result and refuted hypothesis, modeling a rigorous evaluation standard.

Evaluation and Benchmarking Agent and Tool Ecosystem GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge GitOfThoughts

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more

5arXiv · cs.CL·24d ago·source ↗

AGENTSERVESIM: Hardware-aware simulator for multi-turn LLM agent serving policies

Researchers introduce AGENTSERVESIM, a simulation framework designed to evaluate serving policies for multi-turn LLM agents without requiring dedicated accelerator hardware. The simulator models program-level execution including turn dependencies, tool-induced gaps, and KV-cache residency across HBM, host DRAM, and CXL memory hierarchies. It reproduces real-system behavior within 6% error on key performance metrics while running on commodity CPUs, enabling cost-effective exploration of scheduling, routing, and cache management policies for agentic workloads.

Training Infrastructure Inference Economics AGENTSERVESIM +1 more

5arXiv · cs.LG·8d ago·source ↗

RevengeBench: Benchmark for Reconstructing Agent Decision Programs from Behavioral Observations

RevengeBench is a new benchmark of 75 LLM-generated, Elo-calibrated policies across five game environments that tests whether a learner can reconstruct a hidden agent's decision program as executable code from behavioral traces alone. The benchmark draws from CodeClash tournament trajectories and allows the learner to design controlled behavioral probes (custom opponent policies) to elicit informative behavior before submitting an executable hypothesis. Evaluated across twelve frontier LLMs, recovery quality ranges from 34 to 72% of initial action-distance closed, with reconstructed policies providing measurable competitive advantage especially for weaker models. The work frames policy reconstruction as a tractable inverse problem in code-space, with implications for opponent modeling and policy interpretability.

Evaluation and Benchmarking Agent and Tool Ecosystem CodeClash RevengeBench

6arXiv · cs.CL·18d ago·source ↗

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem DeliveryBench AgentSpec MiniGrid +2 more