product
Auto Benchmark Audit (ABA)
productactiveprovisional
auto-benchmark-audit-aba--e65500cd·1 events·first seen 22d agoAliases: Auto Benchmark Audit (ABA)
Co-occurring entities
More like this (12)
Recent events (1)
Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)
The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.