Entity · benchmark

TestEvo-Bench

benchmarkactiveprovisionaltestevo-bench-b7c04384·1 events·first seen 20h ago

Aliases: TestEvo-Bench

Co-occurring entities

Gemini 3.1 Pro Gemini CLI Claude Opus 4.6 Google SWE-Agent Claude Code Anthropic

More like this (12)

EvoBench TriggerBench T3Bench ProgramBench DevDataBench EVA-Bench Data 2.0 LiveBench EQ-Bench IVEBench SorryBench T1-Bench MemBench

Recent events (1)

5arXiv · cs.CL·20h ago·source ↗

TestEvo-Bench: Live executable benchmark for test and code co-evolution tasks

Researchers introduce TestEvo-Bench, a benchmark of 1,255 tasks (746 test generation, 509 test update) mined from 152 open-source Java projects, designed to evaluate whether AI agents can correctly propagate code changes into test suites. Each task is anchored to a real commit and packaged with execution environments, enabling pass rate, coverage, and mutation score metrics. The benchmark is 'live' — new tasks are periodically mined and timestamped to allow evaluation restricted to post-training-cutoff data, reducing leakage risk. Experiments with Claude Code, Gemini CLI, and SWE-Agent paired with Claude Opus 4.7 and Gemini 3.1 Pro show up to 77.5% success on test generation, but performance drops notably on the most recent tasks and under cost constraints.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.1 Pro Gemini CLI Claude Opus 4.6 +5 more