5arXiv cs.CL (Computation and Language)·21h ago

TestEvo-Bench: Live executable benchmark for test and code co-evolution tasks

Researchers introduce TestEvo-Bench, a benchmark of 1,255 tasks (746 test generation, 509 test update) mined from 152 open-source Java projects, designed to evaluate whether AI agents can correctly propagate code changes into test suites. Each task is anchored to a real commit and packaged with execution environments, enabling pass rate, coverage, and mutation score metrics. The benchmark is 'live' — new tasks are periodically mined and timestamped to allow evaluation restricted to post-training-cutoff data, reducing leakage risk. Experiments with Claude Code, Gemini CLI, and SWE-Agent paired with Claude Opus 4.7 and Gemini 3.1 Pro show up to 77.5% success on test generation, but performance drops notably on the most recent tasks and under cost constraints.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.1 Pro Gemini CLI Claude Opus 4.6 Google SWE-Agent Claude Code Anthropic TestEvo-Bench

Related guides (4)

Anthropic

Anthropic: The AI Safety Company at the Center of the Frontier

Read asBeginner In-depth

Google

Google: The Full-Stack AI Contender from Research to Consumer

Read asIn-depth

Claude Opus 4.6

Claude Opus 4.6: Anthropic's Leap into Million-Token, Agentic AI

Read asBeginner

Claude Code

Claude Code: Anthropic's Autonomous Coding Agent

Read asBeginner In-depthfeatured

Related events (8)

5arXiv · cs.CL·Jun 3, 2026·source ↗

RealClawBench: Live benchmark framework built from real developer-agent sessions

RealClawBench is a new benchmark framework that converts real OpenClaw developer-agent sessions into reproducible, automatically scored evaluation tasks. It addresses realism gaps in existing agent benchmarks through reconstructed execution environments and deterministic verifiable scorers, releasing 281 executable tasks sampled to preserve the source session distribution. Evaluation of 14 contemporary models shows the best system solves only 65.8% of tasks, indicating substantial headroom on realistic developer-agent workloads.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenClaw RealClawBench

5arXiv · cs.CL·Jun 23, 2026·source ↗

EnterpriseClawBench: A benchmark for enterprise agents derived from real workplace sessions

Researchers introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary real-world workplace sessions, yielding 852 reproducible tasks with fixtures, prompts, role classes, skill subclasses, and semantic rubrics. Because the sessions contain internal enterprise content, the benchmark data is not publicly released, but the construction and evaluation protocol is the reusable contribution. The best evaluated configuration (Codex with GPT-5.5) achieves only 0.663, indicating substantial headroom. The paper argues enterprise agent evaluation must report harness-model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior rather than collapsing to a single score.

Evaluation and Benchmarking Enterprise Deployment Patterns FrontisAI EnterpriseClawBench Codex +2 more

6The Batch·Jun 19, 2026·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

Evaluation and Benchmarking Agent and Tool Ecosystem Artificial Analysis Llama 3.1 70B Datacurve +13 more

5Hugging Face Blog·May 19, 2026·source ↗

BigCodeBench: The Next Generation of HumanEval

Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.

Evaluation and Benchmarking Agent and Tool Ecosystem BigCodeBench Hugging Face HumanEval

5arXiv · cs.CL·Jun 23, 2026·source ↗

MacAgentBench: New benchmark for AI agents on real-world macOS desktop tasks

MacAgentBench introduces a 676-task benchmark across 25 macOS applications designed to evaluate computer use agents (CUAs) with framework augmentation and fine-grained multi-checkpoint scoring, addressing gaps in existing binary-evaluation benchmarks. Nearly 60% of tasks involve both GUI and CLI interaction, and the benchmark tests 16 models across three agent frameworks. The best result — Claude Opus 4.6 on the OpenClaw framework — achieves 73.7% Pass@1, with performance gains attributed primarily to the skill library rather than framework design. Fine-grained metrics reveal that models with similar Pass@1 scores can differ substantially in sub-goal completion, highlighting limitations of coarse evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem Claude Opus 4.6 OpenClaw MacAgentBench +1 more

5Openai Blog·May 20, 2026·source ↗

Introducing EVMbench: AI Agent Benchmark for Smart Contract Vulnerabilities

OpenAI and Paradigm have jointly introduced EVMbench, a benchmark designed to evaluate AI agents on their ability to detect, patch, and exploit high-severity vulnerabilities in Ethereum Virtual Machine (EVM) smart contracts. The benchmark targets a specialized security domain requiring both code understanding and adversarial reasoning. This represents a new evaluation surface for frontier AI agents in the context of blockchain security.

Evaluation and Benchmarking Agent and Tool Ecosystem Ethereum Virtual Machine OpenAI Paradigm +1 more

7arXiv · cs.LG·3d ago·source ↗

SWE-Interact benchmark evaluates coding agents on multi-turn, user-driven software engineering tasks

SWE-Interact is a new benchmark testbed that evaluates coding agents in realistic multi-turn developer workflows, where a user simulator starts with vague instructions and progressively reveals requirements. Unlike existing SWE benchmarks that provide complete specs upfront, SWE-Interact tests interactive goal discovery and iterative refinement. Frontier models including Claude Opus 4.8 and GPT-5.5 solve ~50% of single-turn baseline tasks but only ~25% of SWE-Interact tasks, revealing a significant capability gap. The benchmark is grounded in large-scale studies of real coding-agent interactions and identifies failure modes like over-agentic coding, requirement forgetting, and early abandonment under ambiguity.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Interact SWE-Bench Verified OpenAI +3 more

5arXiv · cs.CL·Jun 8, 2026·source ↗

SWE-Explore: New benchmark isolates repository exploration capability in coding agents

SWE-Explore is a new benchmark targeting repository exploration as a distinct, fine-grained capability of coding agents, separate from end-to-end task resolution. It covers 848 issues across 10 programming languages and 203 open-source repositories, with line-level ground truth derived from successful agent trajectories. Evaluation across retrieval methods, coding agents, and specialized localizers finds that agentic explorers outperform classical retrieval, and that line-level coverage and efficient ranking remain the key differentiators at the frontier. The benchmark addresses a gap in SWE-bench-style evaluations that treat task resolution as a binary outcome.

Evaluation and Benchmarking Agent and Tool Ecosystem SWE-Explore SWE-bench

TestEvo-Bench: Live executable benchmark for test and code co-evolution tasks

Related events (8)

5arXiv · cs.CL·Jun 3, 2026·source ↗