Almanac
product

Auto Benchmark Audit (ABA)

productactiveprovisionalauto-benchmark-audit-aba--e65500cd·1 events·first seen 22d ago

Aliases: Auto Benchmark Audit (ABA)

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.CL·22d ago·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.