Almanac
← Events
7OpenAI Blog·1mo ago

Introducing the SWE-Lancer benchmark

OpenAI has released SWE-Lancer, a new benchmark that evaluates frontier LLMs on real-world freelance software engineering tasks sourced from Upwork, with a total payout value of $1 million. The benchmark tests whether models can complete tasks that human freelancers were paid to do, grounding evaluation in economic value rather than synthetic metrics. This positions SWE-Lancer as a practically-oriented complement to existing code benchmarks like SWE-bench.

Related guides (4)

Related events (8)

6Openai Blog·1mo ago·source ↗

Introducing SWE-bench Verified

OpenAI is releasing SWE-bench Verified, a human-validated subset of the SWE-bench benchmark designed to more reliably evaluate AI models on real-world software engineering tasks. The original SWE-bench contained issues that were ambiguous or unsolvable, leading to unreliable scores; the Verified subset addresses this by having human annotators confirm task solvability and clarity. This provides a cleaner signal for comparing coding agent performance across labs.

7Openai Blog·1mo ago·source ↗

OpenAI Abandons SWE-bench Verified Over Contamination and Measurement Flaws

OpenAI has announced it will no longer evaluate models on SWE-bench Verified, citing benchmark contamination and flawed test cases that cause it to mismeasure frontier coding capabilities. Their analysis identified both problematic test design and training data leakage as sources of unreliability. OpenAI recommends SWE-bench Pro as a replacement benchmark for evaluating coding progress.

6arXiv · cs.AI·12d ago·source ↗

AARRI-Bench evaluates frontier LLMs and agents on granular research-intern-level tasks

Researchers introduce AARR (Act As a Real Researcher), a new benchmark series targeting whether AI agents can emulate the professionalism, thoroughness, and nuanced judgment of human researchers in granular research scenarios—not just macro-level task execution. The first benchmark, AARRI-Bench, tests frontier models and agentic harnesses, finding that even the best configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3% success, frequently missing subtle but critical details obvious to human researchers. The work argues that closing the gap requires deeper modeling of research behavior rather than more complex scaffolding.

6The Batch·35h ago·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

6Openai Blog·1mo ago·source ↗

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

OpenAI introduces MLE-bench, a benchmark designed to measure AI agent performance on machine learning engineering tasks. The benchmark draws from Kaggle competitions to evaluate agents on realistic ML engineering workflows. Initial results show that current agents, including those powered by o1-preview, achieve competitive performance on a subset of tasks but fall well short of top human competitors. The benchmark is intended to track progress in agentic ML capabilities over time.

5arXiv · cs.CL·9d ago·source ↗

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.

5arXiv · cs.CL·10d ago·source ↗

NCRE-based benchmark reveals frontier LLMs top out at 68.8% on professional Office automation tasks

Researchers introduce an evaluation suite derived from China's National Computer Rank Examination (NCRE), comprising 200 practical tasks across Word, Excel, and PowerPoint scored via 7,118 machine-gradable criteria. Seven frontier LLMs are benchmarked: single-turn models peak at 36.6% Score Rate, while a full agentic system with execution feedback and iterative repair reaches 68.8%, still well below the 95.5% community-reference score. The results demonstrate that fine-grained, long-horizon Office document automation remains a significant unsolved challenge for current LLM and agent systems despite strong code-generation capabilities.

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open FinLLM Leaderboard

Hugging Face has launched the Open FinLLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on financial domain tasks. The leaderboard aims to provide standardized, open evaluation of LLMs across finance-specific capabilities such as financial reasoning, document understanding, and numerical analysis. This fills a gap in domain-specific evaluation infrastructure for the financial sector.