benchmark
SWE-Lancer
benchmarkactive
swe-lancer-7a438b8c·1 events·first seen 28d agoAliases: SWE-Lancer
Co-occurring entities
More like this (12)
Recent events (1)
Introducing the SWE-Lancer benchmark
OpenAI has released SWE-Lancer, a new benchmark that evaluates frontier LLMs on real-world freelance software engineering tasks sourced from Upwork, with a total payout value of $1 million. The benchmark tests whether models can complete tasks that human freelancers were paid to do, grounding evaluation in economic value rather than synthetic metrics. This positions SWE-Lancer as a practically-oriented complement to existing code benchmarks like SWE-bench.