5arXiv cs.CL (Computation and Language)·11h ago

MacAgentBench: New benchmark for AI agents on real-world macOS desktop tasks

MacAgentBench introduces a 676-task benchmark across 25 macOS applications designed to evaluate computer use agents (CUAs) with framework augmentation and fine-grained multi-checkpoint scoring, addressing gaps in existing binary-evaluation benchmarks. Nearly 60% of tasks involve both GUI and CLI interaction, and the benchmark tests 16 models across three agent frameworks. The best result — Claude Opus 4.6 on the OpenClaw framework — achieves 73.7% Pass@1, with performance gains attributed primarily to the skill library rather than framework design. Fine-grained metrics reveal that models with similar Pass@1 scores can differ substantially in sub-goal completion, highlighting limitations of coarse evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 OpenClaw MacAgentBench Anthropic

Related guides (4)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Anthropic

Anthropic: Frontier AI Lab at the Intersection of Capability and Safety Governance

Read asIn-depth

Claude Opus 4.6

Claude Opus 4.6: Anthropic's Milestone Model for Long-Context and Agentic Work

Read asBeginner

Related events (8)

5arXiv · cs.CL·19d ago·source ↗

RealClawBench: Live benchmark framework built from real developer-agent sessions

RealClawBench is a new benchmark framework that converts real OpenClaw developer-agent sessions into reproducible, automatically scored evaluation tasks. It addresses realism gaps in existing agent benchmarks through reconstructed execution environments and deterministic verifiable scorers, releasing 281 executable tasks sampled to preserve the source session distribution. Evaluation of 14 contemporary models shows the best system solves only 65.8% of tasks, indicating substantial headroom on realistic developer-agent workloads.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenClaw RealClawBench

5arXiv · cs.CL·5h ago·source ↗

EnterpriseClawBench: A benchmark for enterprise agents derived from real workplace sessions

Researchers introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary real-world workplace sessions, yielding 852 reproducible tasks with fixtures, prompts, role classes, skill subclasses, and semantic rubrics. Because the sessions contain internal enterprise content, the benchmark data is not publicly released, but the construction and evaluation protocol is the reusable contribution. The best evaluated configuration (Codex with GPT-5.5) achieves only 0.663, indicating substantial headroom. The paper argues enterprise agent evaluation must report harness-model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior rather than collapsing to a single score.

Evaluation and Benchmarking Enterprise Deployment Patterns FrontisAI EnterpriseClawBench Codex +2 more

6arXiv · cs.AI·28d ago·source ↗

Claw-Anything: Benchmark for Always-On Personal Assistants with Broad Digital World Access

Claw-Anything is a new benchmark designed to evaluate LLM agents acting as always-on personal assistants with access to long-horizon activity histories, interdependent backend services, and multi-device GUI/CLI interaction. The benchmark simulates months of user activity to create complex, noisy world states and evaluates both reactive and proactive assistance. GPT-5.5 achieves only 34.5% pass@1, revealing a substantial capability gap versus prior narrower benchmarks. An accompanying automated data-generation pipeline produces 2,000 training environments and yields a 23.7% improvement over the base model.

Long Context Evolution Evaluation and Benchmarking multi-round event injection Claw-Anything large language model agents +3 more

7arXiv · cs.AI·19d ago·source ↗

AutoLab benchmark evaluates frontier models on ultra long-horizon iterative research and engineering tasks

AutoLab is a new benchmark of 36 expert-curated tasks across system optimization, puzzle-solving, model development, and CUDA kernel optimization, designed to test agents on sustained closed-loop improvement under wall-clock budgets rather than single-turn or short-horizon settings. Evaluation of 17 frontier models finds that persistence in iterative benchmarking and feedback incorporation — not initial attempt quality — is the dominant success predictor. Claude Opus 4.6 stands out as the strongest performer, while most models including proprietary ones either terminate early or exhaust budgets with minimal progress. The benchmark, harness, and task artifacts are open-sourced.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 AutoLab Anthropic +1 more

6arXiv · cs.AI·18d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

4arXiv · cs.CL·13d ago·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

Evaluation and Benchmarking Agent and Tool Ecosystem T1-Bench

6arXiv · cs.CL·14d ago·source ↗

iOSWorld: Benchmark for Personalized iOS Phone Agents with Persistent User Identity

Researchers introduce iOSWorld, the first interactive native iOS simulator benchmark designed to evaluate phone agents on personalized, identity-aware tasks across 26 custom-built iOS apps. The benchmark includes 133 tasks spanning single-app, multi-app, and memory/personalization categories, with connected personal data such as transactions, messages, and social relationships. Frontier models reach only 52% overall and 37% on multi-app tasks; privileged vision+XML access improves frontier models by up to 26 percentage points but does not help smaller models. The benchmark is released open-source with all apps, data, tasks, and evaluation code.

Evaluation and Benchmarking Agent and Tool Ecosystem iOSWorld

6arXiv · cs.AI·11d ago·source ↗

AgentBeats: Standardized Agent Evaluation via A2A and MCP Protocols

A new arXiv preprint proposes Agentified Agent Assessment (AAA), a framework where evaluation is performed by judge agents interacting through standardized protocols—A2A for task management and MCP for tool access—rather than bespoke benchmark harnesses. The authors introduce AgentBeats as a concrete implementation, validated through a five-month open competition with 298 judge agents and 467 subject agents across 12 categories, plus a coding-agent case study. The work addresses fragmentation in agent evaluation by decoupling assessment logic from agent implementation, enabling reproducible and interoperable benchmarking.

Evaluation and Benchmarking Agent and Tool Ecosystem AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility AgentBeats MCP +1 more