6arXiv cs.CL (Computation and Language)·5d ago

AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds

AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem DeliveryBench AgentSpec MiniGrid RoboTHOR ALFRED

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·1mo ago·source ↗

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

This paper introduces the stochastic-deterministic boundary (SDB) as a foundational architectural primitive for production LLM agent runtimes, defining it as a four-part contract (proposer, verifier, commit step, reject signal) governing how LLM outputs become system actions. The authors organize agent runtime design around Coordination, State, and Control concerns, presenting a catalog of six runtime patterns applicable to conversational, autonomous, and long-horizon agents. A five-step pattern-selection methodology and diagnostic procedure mapping production failures to pattern weaknesses are contributed, along with a newly named failure mode—replay divergence—where LLM consumers of deterministic event logs produce inconsistent outputs across model versions or prompt changes. The paper argues that as model variance decreases, architectural pattern choice and SDB strength become the dominant reliability levers.

Evaluation and Benchmarking Enterprise Deployment Patterns replay divergence human-in-the-loop pattern hierarchical delegation pattern +4 more

6arXiv · cs.CL·12d ago·source ↗

Agentopia: Long-term multi-agent life simulation framework for training LLMs on social behavior

Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.

Agent and Tool Ecosystem Alignment and RLHF Agentopia Agentopia: Long-Term Life Simulation and Learning in Agent Societies

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

4arXiv · cs.AI·16d ago·source ↗

AgentMob: Training-free LLM agent framework for evidence-grounded mobility prediction

AgentMob is a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making, using a fast path for routine cases and iterative tool use for ambiguous ones. Evaluated on three mobility datasets, it achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42% Acc@1 on the BW dataset. The framework demonstrates that LLM controllers add most value in resolving ambiguous predictions through adaptive evidence gathering rather than routine cases.

Agent and Tool Ecosystem YJMob100K AgentMob BW +1 more

4Github Trending·24d ago·source ↗

AgentScope: Open-Source Multi-Agent Framework for Transparent, Trustworthy Agents

AgentScope is an open-source Python framework for building and running AI agents with an emphasis on observability and trustworthiness. The repository has accumulated 25,755 total GitHub stars with 95 new stars today, indicating sustained community interest. It targets developers building multi-agent systems and positions itself around interpretability and reliability of agent behavior.

Agent and Tool Ecosystem AgentScope

6arXiv · cs.CL·8d ago·source ↗

EurekAgent: Environment Engineering as the Key Bottleneck for Autonomous Scientific Discovery

EurekAgent is a new LLM-based agent system that reframes autonomous scientific discovery around 'environment engineering' — designing the resources, constraints, and interfaces that shape agent behavior — rather than prescribing agent workflows. The system engineers four dimensions: permissions, artifact management (filesystem/Git), budget awareness, and human-in-the-loop oversight. It achieves state-of-the-art results on mathematics, kernel engineering, and ML tasks, including new 26-circle packing results at under $11 in API cost, and is fully open-sourced.

Evaluation and Benchmarking Agent and Tool Ecosystem EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery EurekAgent

3Github Trending·12d ago·source ↗

agent-teams-ai: multi-agent orchestration framework with kanban-style oversight

A TypeScript open-source project on GitHub implements a multi-agent system where autonomous agents handle tasks, communicate with each other, and review each other's work, while the user supervises via a kanban board. The framework supports 200+ models across 75+ LLM providers including Codex, Claude, and OpenCode. It has accumulated 1,189 stars with 56 added today, suggesting growing community interest.

Agent and Tool Ecosystem agent-teams-ai OpenAI Anthropic

6arXiv · cs.AI·8d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.

Evaluation and Benchmarking Agent and Tool Ecosystem AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility AgentBeats MCP +1 more