5arXiv cs.CL (Computation and Language)·10d ago

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.

Evaluation and Benchmarking Inference Economics Agent and Tool Ecosystem SWE-Bench Multilingual OpenClaw SWE-Bench Verified Claw-SWE-Bench SWE-bench GLM-5.1

Related guides (3)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·18d ago·source ↗

RealClawBench: Live benchmark framework built from real developer-agent sessions

RealClawBench is a new benchmark framework that converts real OpenClaw developer-agent sessions into reproducible, automatically scored evaluation tasks. It addresses realism gaps in existing agent benchmarks through reconstructed execution environments and deterministic verifiable scorers, releasing 281 executable tasks sampled to preserve the source session distribution. Evaluation of 14 contemporary models shows the best system solves only 65.8% of tasks, indicating substantial headroom on realistic developer-agent workloads.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenClaw RealClawBench

6arXiv · cs.AI·26d ago·source ↗

Claw-Anything: Benchmark for Always-On Personal Assistants with Broad Digital World Access

Claw-Anything is a new benchmark designed to evaluate LLM agents acting as always-on personal assistants with access to long-horizon activity histories, interdependent backend services, and multi-device GUI/CLI interaction. The benchmark simulates months of user activity to create complex, noisy world states and evaluates both reactive and proactive assistance. GPT-5.5 achieves only 34.5% pass@1, revealing a substantial capability gap versus prior narrower benchmarks. An accompanying automated data-generation pipeline produces 2,000 training environments and yields a 23.7% improvement over the base model.

Long Context Evolution Evaluation and Benchmarking multi-round event injection Claw-Anything large language model agents +3 more

5arXiv · cs.CL·13d ago·source ↗

SWE-Explore: New benchmark isolates repository exploration capability in coding agents

SWE-Explore is a new benchmark targeting repository exploration as a distinct, fine-grained capability of coding agents, separate from end-to-end task resolution. It covers 848 issues across 10 programming languages and 203 open-source repositories, with line-level ground truth derived from successful agent trajectories. Evaluation across retrieval methods, coding agents, and specialized localizers finds that agentic explorers outperform classical retrieval, and that line-level coverage and efficient ranking remain the key differentiators at the frontier. The benchmark addresses a gap in SWE-bench-style evaluations that treat task resolution as a binary outcome.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Explore SWE-bench

6The Batch·2d ago·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

Evaluation and Benchmarking Agent and Tool Ecosystem Artificial Analysis Llama 3.1 70B Datacurve +13 more

6arXiv · cs.AI·16d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

6arXiv · cs.AI·13d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 SWE-bench AARRI-Bench +2 more

7arXiv · cs.CL·26d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more

6arXiv · cs.LG·26d ago·source ↗

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper argues that the next major bottleneck in agentic AI is system-level design—what the authors call 'scaling the harness'—rather than continued model scaling alone. The agent harness encompasses memory substrates, context constructors, skill-routing layers, orchestration loops, and verification/governance components that together translate model capability into long-horizon behavior. The authors identify three core bottlenecks (context governance, trustworthy memory, dynamic skill routing) and propose harness-level benchmarks measuring trajectory quality, memory hygiene, and verification cost. They introduce CheetahClaws, a Python-native reference harness, and compare it against Claude Code and OpenClaw.

Evaluation and Benchmarking Inference Economics SafeRL-Lab dynamic skill routing Scaling the Harness (paper)+8 more