swe-perf-8c6d6eaf·1 events·first seen Aliases: SWE-Perf
A new arXiv paper audits three prominent repository-level code-optimization benchmarks (GSO, SWE-Perf, SWE-fficiency) used to rank coding agents, finding significant reliability problems across all three. Reference patches satisfy validity rules in cross-machine replays for only 39/102 GSO tasks and 11/140 SWE-Perf tasks, and leaderboard rankings disagree on 9 of 28 pairwise comparisons depending on scoring rule choice. The authors also find that at least one public submission already matches or beats the reference patch on 85.3% of replay-valid tasks, suggesting aggregate leaderboard scores obscure the true frontier. The study raises substantive concerns about whether these benchmarks are providing reliable signal for claims of coding-agent capability progress.