Entity · benchmark

SWE-Perf

benchmarkactiveprovisionalswe-perf-8c6d6eaf·1 events·first seen 40h ago

Aliases: SWE-Perf

Co-occurring entities

Google Cloud Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?SWE-fficiency GSO

More like this (12)

SWE-Pro SWE-fficiency SWE-Explore SWE-Interact SWE-bench SWE-Agent SWE-Smith SWE-Bench Lite SWE-Bench Verified DeepSWE FrontierSWE SWE-Marathon

Recent events (1)

7arXiv · cs.AI·40h ago·source ↗

Audit finds GSO, SWE-Perf, and SWE-fficiency benchmarks unreliable for measuring coding agent progress

A new arXiv paper audits three prominent repository-level code-optimization benchmarks (GSO, SWE-Perf, SWE-fficiency) used to rank coding agents, finding significant reliability problems across all three. Reference patches satisfy validity rules in cross-machine replays for only 39/102 GSO tasks and 11/140 SWE-Perf tasks, and leaderboard rankings disagree on 9 of 28 pairwise comparisons depending on scoring rule choice. The authors also find that at least one public submission already matches or beats the reference patch on 85.3% of replay-valid tasks, suggesting aggregate leaderboard scores obscure the true frontier. The study raises substantive concerns about whether these benchmarks are providing reliable signal for claims of coding-agent capability progress.

Evaluation and Benchmarking Agent and Tool Ecosystem Google Cloud Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?SWE-fficiency +2 more