Entity · product

Auto Benchmark Audit (ABA)

productactiveauto-benchmark-audit-aba--e65500cd·1 events·first seen May 26, 2026

Aliases: Auto Benchmark Audit (ABA)

Co-occurring entities

NeurIPS SWE-Bench Verified Terminal-Bench

More like this (12)

AutomationBench-AA alignment auditing AssetOpsBench MemoryAgentBench APS-Bench temporally grounded QA benchmark OpAI-Bench TriggerBench AI Reproducibility Benchmark ITBench-AA DBA-Bench VerifierBench

Recent events (1)

7arXiv · cs.CL·May 26, 2026·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more