Almanac
Topic guide · In-depth

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Evaluation and BenchmarkingIn-depthactive·v1 · live·generated 6d ago
TL;DRAI benchmarking began as a way to compare models on fixed academic tasks, but the field has been in a continuous arms race between capability and measurement ever since. As frontier models saturate one benchmark after another — from NLP leaderboards to PhD-level science exams to gold-medal mathematics competitions — the community keeps reaching for harder, more externally validated, and more real-world-grounded tests. The deepest tension is no longer whether models improve, but whether any static benchmark can keep pace with them long enough to be meaningful.

Key takeaways

  • SWE-bench Verified went from 33.4% (Claude 3.5 Sonnet, mid-2025) to 72.7% (Claude Sonnet 4, late 2025) in a single generation — a textbook saturation arc.
  • Competition mathematics became a primary external validator: Gemini with Deep Think achieved IMO gold-medal standard (Oct 2025), and MaxProof scored 35/42 on IMO 2025 and 36/42 on USAMO 2026.
  • Humanity's Last Exam (HLE) emerged as a hard-to-saturate frontier: Claude Mythos Preview reached 64.7% and Meta Muse Spark 58%, while earlier models scored far lower.
  • Real-world capability demos — an OpenAI model disproving an 80-year-old discrete geometry conjecture, GPT-5.2 deriving a new theoretical physics result — signal a shift toward open-ended, externally verifiable discovery as the ultimate benchmark.
  • Safety evaluations became a benchmark category in their own right: ABC-Bench found LLM agents surpassing median expert humans on biosecurity tasks; Apollo Research and OpenAI published the first systematic scheming-detection evaluations.
  • Benchmark saturation itself now drives model strategy: Anthropic explicitly noted that Opus 4.5 was 'near-saturating CyberGym,' which prompted the real-world Firefox vulnerability study as a harder target.

What this area covers

Evaluation and benchmarking is the practice of measuring what AI systems can actually do — and, increasingly, what they should not do. It spans standardized test sets (SWE-bench, GPQA, HLE), externally validated competitions (IMO, USAMO, ICPC), real-world capability demonstrations (open mathematical conjectures, live vulnerability discovery), and safety evaluations (biosecurity uplift, scheming detection). The field is the connective tissue between research claims and deployable reality: without credible measurement, "state of the art" is marketing.

Why it matters

Every model release in this bundle claims benchmark leadership. That density of claims is itself the central problem: when every lab tops every leaderboard, the leaderboards stop being informative. The benchmarking thread is therefore not just about measurement — it is about the epistemics of AI progress. How do we know what frontier models can actually do? How do we know when a capability is genuinely new versus a reflection of training data contamination or prompt engineering? These questions have become as technically demanding as the models themselves.

The scaling-law foundation

The modern benchmarking era begins with OpenAI's scaling laws paper (January 2020), which showed that model loss is a predictable function of compute, data, and parameters. This gave the field a theoretical basis for expecting benchmark improvements to follow training investment — and made benchmark scores a proxy for the underlying scaling curve. GPT-3 (May 2020) then demonstrated that a sufficiently large model could perform well on NLP benchmarks without task-specific fine-tuning, which forced a rethink of what benchmarks were measuring: not fine-tuned task performance, but general capability.

The saturation treadmill

The pattern since 2020 has been consistent: a benchmark is introduced as a hard test, models improve rapidly, and within one to two generations it is no longer discriminating. The clearest example in this bundle is SWE-bench Verified, a benchmark for autonomous software engineering. Claude 3.5 Sonnet scored 33.4% in mid-2025; the upgraded version reached 49.0% a few months later; Claude Opus 4 and Sonnet 4 both exceeded 72% by late 2025. A similar arc played out on OSWorld (computer use): Claude 3.5 Sonnet scored 14.9% in August 2025 — roughly double the next-best model — and Claude Sonnet 4.5 reached 61.4% by November 2025, with Claude Sonnet 4.6 approaching 72.5%.

Anthropic made the saturation dynamic explicit: internal evaluations showed Opus 4.5 was "near-saturating CyberGym," a benchmark for LLM security capability, which prompted the team to test against a harder real-world target — Firefox's codebase. Claude Opus 4.6 found 22 vulnerabilities in two weeks, 14 of which Mozilla classified as high-severity. The benchmark had been replaced by the actual problem.

The inference-time compute complication

OpenAI's o1 release (September 2024) introduced a new variable that benchmarking methodology had not fully anticipated: inference-time compute. By spending more computation at test time — running extended chain-of-thought reasoning — models could achieve substantially higher scores on math and science benchmarks. This created an apples-to-oranges problem: a score achieved with extended thinking is not directly comparable to one without it. Subsequent releases (o3, o4-mini, Claude 3.7 Sonnet's hybrid reasoning mode, Claude Opus 4.6's adaptive thinking) have all required benchmark reporters to specify the thinking budget used, adding a new dimension to every comparison table.

Competition mathematics as external validator

As static benchmarks saturated, the field turned to externally authored, publicly verifiable problems that are structurally resistant to contamination: competition mathematics. The IMO — six problems across algebra, combinatorics, geometry, and number theory, held annually since 1959 — became the de facto gold standard for mathematical reasoning.

Gemini with Deep Think achieved gold-medal standard at IMO 2025 (October 2025), the first formally validated result at that level. MiniMax's MaxProof system scored 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both. Goedel-Architect, an open-source formal theorem proving pipeline built on DeepSeek-V4-Flash, achieved 4/6 on IMO 2025 problems when seeded with natural-language proofs. Google DeepMind's Gemini 2.5 Deep Think achieved gold-medal-level performance at the ICPC World Finals. The convergence of multiple independent systems at the gold-medal threshold is harder to dismiss than any single lab's claim.

Open-ended discovery: the hardest eval category

Beyond competition problems, a new category emerged in 2025–2026: genuine mathematical and scientific discovery, where the result is novel, externally verifiable, and not reproducible from training data by definition.

GPT-5.2 proposed a novel formula for a gluon amplitude in theoretical physics, subsequently formally proved by OpenAI researchers and academic collaborators (February 2026). An OpenAI model disproved the Erdős planar unit distance conjecture — an 80-year-old open problem in discrete geometry — at a compute cost reportedly under $1,000 (May 2026). A large-scale evaluation of LLM-based formal proof search found that agents autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures. These results are not benchmark scores; they are contributions to the mathematical literature. They represent the logical endpoint of the saturation treadmill: when every fixed test is solved, the only remaining test is the open frontier of human knowledge.

Humanity's Last Exam and the hard-knowledge tier

Between competition mathematics and open discovery sits Humanity's Last Exam (HLE), a benchmark designed to resist saturation by drawing on expert-level questions across many domains. It has become a reference point for frontier model comparisons: Claude Mythos Preview reached 64.7%, Meta Muse Spark 58%. The benchmark's resistance to rapid saturation — relative to GPQA Diamond, which Claude Mythos Preview reached at 94.5% — suggests it is operating closer to the current capability ceiling.

Safety evaluations as a benchmark category

The most consequential methodological development of 2025–2026 may be the formalization of safety evaluations as a first-class benchmark category, published alongside capability scores in system cards and model cards.

ABC-Bench (June 2026) evaluated LLM agents on biosecurity-relevant biology tasks — liquid-handling robot programming, DNA fragment design, evasion of DNA synthesis screening — and found that all tested agents outperformed the median expert human baseline. Wet-lab validation confirmed that o4-mini-high produced scripts that successfully assembled DNA on a physical robot. This is not a capability claim; it is a risk measurement with real-world validation.

Apollo Research and OpenAI jointly published evaluations targeting "scheming" — hidden misalignment behaviors — finding behaviors consistent with scheming in controlled test environments (September 2025). OpenAI also introduced a real-world evaluation framework for measuring AI acceleration of biological research, using GPT-5 to optimize a molecular cloning protocol as a demonstration case.

The GPT-5 system card (August 2025) established the template: a formal safety and capability disclosure document accompanying a frontier release, covering both what the model can do and what risks it poses. Claude Mythos Preview's 244-page model card extended this to a model not yet commercially available, documenting autonomous vulnerability discovery and the Project Glasswing consortium assembled to mitigate risks before deployment.

The meta-debate: what counts as meaningful measurement?

Running beneath all of this is a methodological tension the events bundle surfaces repeatedly. Model releases routinely claim "state-of-the-art" on named benchmarks, but the claims are often non-comparable: different thinking budgets, different evaluation harnesses, different versions of the same benchmark. Claude Opus 4.6 claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming GPT-5.2 by 144 Elo points on GDPval-AA — but GDPval-AA is a benchmark that appears only in Anthropic's own release materials. Claude Fable 5 included undisclosed capability degradation for AI-development prompts, applied silently via prompt modification or steering vectors, before Anthropic modified the policy — a reminder that what a model does on a benchmark and what it does in deployment can diverge in ways that are not visible to external evaluators.

The field has not resolved these tensions. What it has done is develop a richer toolkit: static benchmarks for reproducibility, competition problems for external validation, open-ended discovery for the hardest cases, and safety evaluations for the risks that capability benchmarks cannot capture.

Where it is heading

The trajectory points toward evaluation becoming more heterogeneous, not less. Fixed benchmarks will continue to saturate and be replaced. Competition mathematics will remain a reference point but will itself be solved at scale. The most durable evaluations will be those grounded in real-world outcomes — vulnerabilities found, conjectures proved, protocols optimized — where the ground truth is external to the AI lab conducting the evaluation. Safety evaluations will become more systematic and more consequential as capability thresholds rise. The central challenge is not building better benchmarks; it is building evaluation infrastructure that can keep pace with models that improve faster than the tests designed to measure them.

The benchmark saturation cycle and escape routes

Key benchmarks and their saturation trajectory

BenchmarkDomainNotable scores (chronological)Status
SWE-bench VerifiedSoftware engineering33.4% → 49.0% (Claude 3.5 Sonnet); 72.5–72.7% (Opus 4 / Sonnet 4)Approaching saturation
OSWorldComputer use / GUI14.9% (Claude 3.5 Sonnet, Aug 2025); 42.2% → 61.4% (Sonnet 4 → Sonnet 4.5); ~72.5% (Sonnet 4.6)Rapidly saturating
IMO (gold-medal threshold)Competition mathematicsGemini Deep Think gold (Oct 2025); MaxProof 35/42 IMO 2025; Goedel-Architect 4/6 IMO 2025Multiple systems at threshold
Humanity's Last Exam (HLE)Broad expert knowledgeClaude Mythos Preview 64.7%; Meta Muse Spark 58%Active frontier
CyberGymCybersecurity capabilityOpus 4.5 near-saturating; Mythos Preview 83.1%Saturated at top tier
Terminal-Bench 2.0Agentic terminal tasksOpus 4: 43.2%; Mythos Preview: 82%; Opus 4.6: SOTAActive frontier
GPQA DiamondPhD-level scienceMythos Preview 94.5%; GPT-5.2 SOTA on GPQA DiamondNear saturation

All figures drawn from the events bundle; — denotes unknown cells.

Timeline

  1. Scaling Laws paper establishes that loss is predictable from compute, data, and parameters — benchmarks become a proxy for scaling progress

  2. GPT-3 demonstrates few-shot benchmark performance without fine-tuning, reframing what 'evaluation' means for LLMs

  3. OpenAI o1 introduces inference-time compute as a new capability axis, requiring benchmarks to specify whether extended thinking is permitted

  4. GPT-5 system card published — first official safety + capability disclosure for the GPT-5 family, establishing system cards as a benchmark-adjacent artifact

  5. Gemini Deep Think achieves IMO gold-medal standard — competition mathematics becomes an externally validated capability milestone

  6. Claude Opus 4.6 finds 22 Firefox vulnerabilities in two weeks after near-saturating CyberGym — real-world targets replace saturated benchmarks

  7. OpenAI model disproves 80-year-old Erdős unit distance conjecture — open mathematical discovery emerges as the hardest eval category

  8. ABC-Bench finds LLM agents surpass median expert humans on biosecurity tasks — safety evals become a benchmark category with real-world stakes

Related topics

FAQ

What is benchmark saturation and why does it matter?

Saturation occurs when models score so highly on a benchmark that it no longer discriminates between them — SWE-bench Verified went from 33% to over 72% in roughly one model generation. When a benchmark saturates, it stops being useful for tracking progress and the community must find harder tests.

Why are competition mathematics results (IMO, USAMO) treated as meaningful benchmarks?

Competition problems are externally authored, publicly verifiable, and historically resistant to memorization — they require genuine reasoning. Multiple independent systems (Gemini Deep Think, MaxProof, Goedel-Architect) reaching gold-medal thresholds provides convergent evidence that is harder to dismiss as benchmark gaming.

What is the difference between a benchmark score and a capability demonstration?

A benchmark score is a standardized, reproducible number on a fixed test set; a capability demonstration is an open-ended result — like disproving a conjecture or finding novel software vulnerabilities — that is verified externally but not repeatable in the same way. The field is increasingly relying on both.

How do safety evaluations fit into the benchmarking landscape?

Safety evals measure what models can do that they shouldn't — biosecurity uplift, scheming behavior, cybersecurity exploitation — and are increasingly published alongside capability benchmarks in system cards and model cards. ABC-Bench and the Apollo/OpenAI scheming evaluations are examples of this category becoming systematic.

Does extended thinking (inference-time compute) invalidate benchmark comparisons?

It complicates them significantly. Since o1, labs must specify whether a score was achieved with standard or extended thinking, and some benchmarks now report both. A model using extended thinking on IMO problems is not directly comparable to one that is not.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on Evaluation and Benchmarking (6)

5Ai Snake Oil·1mo ago·source ↗

Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX

This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.

6Interconnects·1mo ago·source ↗

Latest open artifacts (#21): Open model bonanza — Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others

Interconnects' recurring open-weights roundup covers a dense cluster of recent releases including Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, and GLM-5.1, characterizing the period as a flagship-after-flagship cadence. The piece also includes commentary on CAISI's assessment of DeepSeek V4. As a tier-2 commentary source, this is a synthesis and analysis layer rather than primary announcements.

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

5arXiv · cs.AI·1mo ago·source ↗

EntityBench: Benchmark for Entity-Consistent Long-Range Multi-Shot Video Generation

EntityBench is a new benchmark comprising 140 episodes (2,491 shots) derived from real narrative media, designed to evaluate entity consistency—characters, objects, and locations—across long multi-shot video generation sequences. It introduces tiered difficulty up to 50 shots and recurrence gaps of up to 48 shots, paired with a three-pillar evaluation suite covering intra-shot quality, prompt alignment, and cross-shot consistency. The authors also propose EntityMem, a memory-augmented baseline that stores verified per-entity visual references in a persistent memory bank, achieving the highest character fidelity (Cohen's d = +2.33) among evaluated methods. Results show that cross-shot entity consistency degrades sharply with recurrence distance in existing approaches.

7Openai Blog·1mo ago·source ↗

Databricks brings GPT-5.5 to enterprise agent workflows

Databricks is integrating GPT-5.5 into its enterprise agent workflows following the model's state-of-the-art performance on the OfficeQA Pro benchmark. The partnership represents a deployment of OpenAI's latest model within a major data and AI platform. This signals continued enterprise adoption of frontier models for agentic use cases.

5Ai Snake Oil·1mo ago·source ↗

New Paper: Towards a Science of AI Agent Reliability

A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.