Almanac
Topic guide · In-depth

Frontier Model Releases: The Race from GPT-3 to Safety-Tiered Superintelligence

Frontier Model ReleasesIn-depthactive·v4 · live·generated 6d ago
TL;DRFrontier model releases have evolved from a scaling race — measured in parameters and benchmark points — into a multi-dimensional contest over agentic capability, safety architecture, and geopolitical control. What began with GPT-3's few-shot learning and ChatGPT's mass adoption has accelerated into a period where labs simultaneously push mathematical reasoning to gold-medal level, deploy models as autonomous cyber agents, and fight governments over what their models are allowed to do.

Key takeaways

  • GPT-3's 175B-parameter few-shot paper (May 2020) and ChatGPT's November 2022 launch are the two founding inflection points that defined the modern frontier.
  • OpenAI's o1 (Sep 2024) introduced inference-time compute scaling via chain-of-thought RL — a new capability axis distinct from training-time scaling.
  • GPT-5 (Aug 2025) and Claude Opus 4 / Sonnet 4 (Sep 2025) mark the generation where SWE-bench Verified scores crossed 70%, making autonomous software engineering a practical claim.
  • Gemini with Deep Think achieved externally validated gold-medal standard at IMO 2025 (Oct 2025); MaxProof (MiniMax) scored 35/42 on IMO 2025 and 36/42 on USAMO 2026 via population-level test-time scaling.
  • OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0 (Aug 2025), marking its first significant entry into open weights.
  • Anthropic's Claude Fable 5 / Mythos 5 (Jun 2026) introduced safety-tiered deployment with per-domain capability degradation — and triggered a US government export-control suspension citing a jailbreak, the first such action against a frontier lab.

What this thread covers

Frontier model releases are the headline-grade checkpoints — new models, major version bumps, and architectural pivots — from the labs operating at the edge of what large-scale AI can do. This thread traces that lineage from the GPT-3 paper through the safety-tiered, government-contested deployments of mid-2026, covering OpenAI, Anthropic, Google DeepMind, Meta, Mistral AI, and MiniMax.

---

Phase 1 — The Scaling Thesis (2020–2022)

The modern frontier begins with a paper: OpenAI's GPT-3 (May 2020), a 175-billion-parameter autoregressive model that demonstrated strong few-shot performance across NLP tasks without gradient updates. The core claim — that scaling model size dramatically improves task-agnostic capability — became the organizing thesis of the field.

The thesis went mass-market in November 2022 with ChatGPT, a conversational wrapper on top of that lineage that could answer follow-up questions, acknowledge errors, and decline inappropriate requests. The dialogue format made the technology legible to non-specialists and triggered a wave of public and enterprise adoption that reshaped the competitive landscape.

---

Phase 2 — Multimodality and Open Weights (2023–mid-2024)

The next phase expanded the modality surface and opened the weights. Mistral AI's Mixtral 8x7B (December 2023) introduced a sparse mixture-of-experts architecture — 46.7B total parameters, 12.9B active per token — that matched or exceeded GPT-3.5 on most benchmarks under an Apache 2.0 license, demonstrating that open-weight models could reach near-frontier quality at a fraction of the inference cost.

OpenAI's GPT-4o (May 2024) moved in the opposite direction: a natively omnimodal architecture processing audio, vision, and text in a unified model without separate pipeline stages, positioned as the primary production model going forward. Meta's Llama 3.1 (July 2024) pushed open weights to 405B parameters — the largest open-weight release to that point — with multilingual support and extended context.

---

Phase 3 — Inference-Time Scaling and the Reasoning Turn (late 2024)

OpenAI's o1 (September 2024) introduced a new capability axis: inference-time compute scaling via chain-of-thought reasoning trained with reinforcement learning. Rather than simply making models larger, o1 spent more compute at inference to reason through problems step by step. The o1-preview ranked in the 89th percentile on competitive programming and performed at PhD level on science benchmarks. This reframed the scaling conversation: training-time compute was no longer the only lever.

Anthropic's Claude 3 family (launched with Haiku, Sonnet, and Opus) established the tiered-model pattern — multiple capability levels under a single brand — with Opus claiming top benchmark positions and a 200K context window with near-perfect recall.

---

Phase 4 — Agentic Coding and the SWE-Bench Era (mid–late 2025)

By mid-2025, the benchmark that mattered most for practitioner credibility was SWE-bench Verified — the fraction of real GitHub issues an AI can resolve autonomously. The upgraded Claude 3.5 Sonnet (July 2025) moved that number from 33.4% to 49.0%, surpassing all publicly available models including reasoning models at the time. Computer use — Claude controlling a desktop by viewing screens, moving cursors, and typing — launched in public beta simultaneously.

GPT-5 (August 2025) arrived with a unified routing architecture (gpt-5-main, gpt-5-thinking, and lightweight variants like gpt-5-thinking-nano) disclosed in its system card, claiming state-of-the-art across coding, mathematics, writing, and visual perception. Alongside it, OpenAI made its first significant open-weight move: gpt-oss-120b and gpt-oss-20b under Apache 2.0, optimized for consumer hardware deployment.

Claude Opus 4 and Sonnet 4 (September 2025) pushed SWE-bench to 72.5% and 72.7% respectively, with hybrid near-instant and extended thinking, parallel tool execution, and Claude Code going generally available with GitHub Actions and IDE integrations. Claude 3.7 Sonnet, released the same day, was positioned as the first hybrid reasoning model — a single model operating in both standard and extended thinking modes.

Claude Sonnet 4.5 (November 2025) extended the agentic lead with a 61.4% score on OSWorld (up from 42.2% for Sonnet 4), a native VS Code extension, and a Claude Agent SDK giving developers access to the same infrastructure powering Claude Code.

Google DeepMind's Gemini 3 (November 2025) and Gemini 3.5 (May 2026) marked the Gemini line's pivot toward agentic capabilities and complex workflow execution.

---

Phase 5 — Novel Science and Mathematical Reasoning (late 2025–early 2026)

A qualitatively new capability claim emerged in this period: frontier models producing novel, verifiable scientific results rather than assisting with known work.

  • Gemini with Deep Think achieved externally validated gold-medal standard at IMO 2025 (October 2025) — six problems across algebra, combinatorics, geometry, and number theory, judged by the same standard as human competitors.
  • GPT-5.2 (December 2025) proposed a novel formula for a gluon amplitude in theoretical physics, subsequently formally proved by OpenAI researchers and academic collaborators.
  • An unnamed OpenAI model disproved an 80-year-old conjecture in discrete geometry — the unit distance problem — announced via the OpenAI blog (May 2026).
  • MiniMax's MaxProof (June 2026) scored 35/42 on IMO 2025 and 36/42 on USAMO 2026 using population-level test-time scaling over a tournament of candidate proofs generated by their MiniMax-M3 model.

These results collectively shift the conversation from "AI assists researchers" to "AI produces research."

---

Phase 6 — Safety Tiers, Geopolitics, and the Governance Frontier (2026)

The most recent releases have introduced a new dimension: not just what models can do, but what they are permitted to do — and by whom.

Claude Opus 4.5 (March 2026) and Claude Opus 4.6 (March 2026) continued the capability march — Opus 4.6 with a 1M-token context window in beta, adaptive thinking with developer-controlled effort levels, and a claimed 144-Elo lead over GPT-5.2 on GDPval-AA. Pricing held at $5/$25 per million tokens.

Claude Fable 5 and Mythos 5 (June 2026) represent the most operationally complex frontier release to date. Mythos 5 is restricted-access, capable of cracking previously secure software. Fable 5 is the general-use version, featuring novel safety classifiers that block or degrade responses on cybersecurity, biology, chemistry, and AI-development topics — including, initially, undisclosed capability degradation applied silently via prompt modification or steering vectors, which sparked controversy before Anthropic modified the policy. Both models set new state-of-the-art results across software engineering, agentic coding, knowledge work, and scientific reasoning, at roughly half the cost of the prior Claude Mythos Preview.

Within 24 hours of the Fable 5 / Mythos 5 launch, the US government issued an export-control directive requiring Anthropic to disable both models for all foreign nationals, citing awareness of a jailbreak method. Anthropic complied while publicly disputing the standard applied — arguing that requiring perfect jailbreak resistance would halt all frontier model deployments industry-wide, noting that the demonstrated technique produces results already achievable by GPT-5.5. This is the first documented case of a US government export-control action forcing a frontier lab to suspend model access.

The governance conflict is not new: Anthropic had already publicly refused DoD demands (February 2026) to remove safeguards on mass domestic surveillance and fully autonomous weapons, and resisted a "supply chain risk" designation. The Fable 5 suspension marks the escalation from threat to enforcement.

---

Where the frontier is heading

The events in this bundle point toward three converging pressures:

1. Capability: Models are crossing from benchmark performance into verifiable novel scientific contribution. The next generation of frontier releases will likely be evaluated partly on whether they produce publishable results, not just benchmark scores.

2. Agentic deployment: Claude Code, computer use, and autonomous cyber operations (including the first documented AI-orchestrated espionage campaign, using Claude Code as an autonomous agent) have made agentic deployment the primary product surface. Future releases will be judged on long-horizon task reliability, not single-turn quality.

3. Governance: The Fable 5 export-control action and the DoD safeguard dispute signal that frontier model releases are now geopolitical events. The question of who controls what a model can do — and for whom — is no longer abstract. It is being litigated in real time between labs, governments, and the public.

Frontier model release lineage by lab and phase

Selected frontier model milestones

Model / SystemLabDateKey capability claimNotable first
GPT-3 (175B)OpenAI2020-05Few-shot learning across NLP tasksScaling law demonstration
ChatGPTOpenAI2022-11Conversational dialogue at mass scalePublic AI adoption inflection
Mixtral 8x7BMistral AI2023-12SMoE: 46.7B params, 12.9B active; matches GPT-3.5Open-weight MoE at frontier quality
GPT-4oOpenAI2024-05Native audio + vision + text in one modelNatively omnimodal flagship
o1 / o1-miniOpenAI2024-09Chain-of-thought RL; PhD-level science benchmarksInference-time compute scaling axis
Llama 3.1 405BMeta2024-07Frontier-class open-weight at 405BLargest open-weight release to date
Claude 3.7 SonnetAnthropic2025-09First hybrid reasoning model; SWE-bench SOTAHybrid instant + extended thinking
GPT-5OpenAI2025-08SOTA coding, math, vision; unified routing architectureMulti-sub-model routing disclosed in system card
gpt-oss-120b / 20bOpenAI2025-08Open-weight reasoning; Apache 2.0OpenAI's first open-weight release
Gemini 3 / 3.5Google DeepMind2025-11 / 2026-05New era intelligence; agentic action focus
Claude Opus 4 / Sonnet 4Anthropic2025-0972.5% SWE-bench; hybrid thinking + parallel toolsClaude Code GA
Claude Opus 4.6Anthropic2026-031M-token context (beta); +144 Elo over GPT-5.2 on GDPval-AAAdaptive effort levels
Claude Fable 5 / Mythos 5Anthropic2026-06SOTA across SW eng, science; per-domain safety tiersFirst US export-control suspension of a frontier model

Dates from published_at fields; unknown cells render —.

Timeline

  1. GPT-3 paper: 175B parameters, few-shot learning

  2. ChatGPT launches — public AI adoption inflection

  3. Mixtral 8x7B: open-weight MoE matches GPT-3.5

  4. GPT-4o: natively omnimodal (audio + vision + text)

  5. o1: inference-time compute scaling via chain-of-thought RL

  6. OpenAI releases gpt-oss-120b / 20b under Apache 2.0

  7. GPT-5 released: unified routing, SOTA across domains

  8. Claude Opus 4 / Sonnet 4: 72.5% SWE-bench; Claude Code GA

  9. Gemini Deep Think: gold-medal standard at IMO 2025

  10. Gemini 3 announced; Claude Sonnet 4.5 released (61.4% OSWorld)

  11. GPT-5.2 released; later derives novel theoretical physics result

  12. Claude Opus 4.6: 1M-token context, adaptive thinking

  13. Gemini 3.5: agentic action focus announced

  14. Claude Fable 5 / Mythos 5: safety-tiered deployment; US export-control suspension follows

Related topics

FAQ

What distinguishes a 'frontier' model release from an ordinary model update?

Frontier releases claim state-of-the-art results on at least one major capability axis — reasoning, coding, multimodality, context length — and typically introduce a new architectural or training technique rather than just incremental fine-tuning.

When did inference-time compute scaling become a distinct capability axis?

OpenAI's o1 release in September 2024 was the public inflection point, introducing chain-of-thought reasoning trained via reinforcement learning as a way to trade inference compute for capability gains independent of model size.

Have frontier models produced genuinely novel scientific results?

Yes — GPT-5.2 proposed a novel gluon amplitude formula in theoretical physics that was subsequently formally proved, and an OpenAI model disproved an 80-year-old conjecture in discrete geometry; both are documented in the events bundle.

What is the significance of the Claude Fable 5 / Mythos 5 export-control action?

It is the first documented case of a US government export-control directive forcing a frontier lab to suspend model access for foreign nationals, triggered by a reported jailbreak — establishing a new regulatory precedent for government authority over commercial AI deployment.

Which labs have released open-weight frontier models?

Meta (Llama 3.1 up to 405B), Mistral AI (Mixtral 8x7B under Apache 2.0), and OpenAI (gpt-oss-120b and gpt-oss-20b under Apache 2.0) are the labs represented in this bundle with open-weight frontier releases.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v4live6d ago
  • v3superseded7d ago
  • v2superseded7d ago
  • v1superseded7d ago

Related guides (4)

More on Frontier Model Releases (6)

7Google Deepmind Blog·1mo ago·source ↗

AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields

DeepMind published a blog post detailing the real-world impact of AlphaEvolve, a Gemini-powered coding agent designed to discover and optimize algorithms. The post covers applications spanning business operations, infrastructure, and scientific research. AlphaEvolve represents a deployment of LLM-driven evolutionary algorithm search at scale across multiple domains.

5Ai Snake Oil·1mo ago·source ↗

Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX

This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.

4One Useful Thing·1mo ago·source ↗

Sign of the Future: GPT-5.5 Commentary

A tier-2 commentary piece from One Useful Thing discusses GPT-5.5 as a notable step in the AI capability curve. The piece frames the release as a signal of future AI development trajectories. As a commentary source, it likely offers analysis of what GPT-5.5's capabilities imply rather than primary technical reporting.

3Don'T Worry About The Vase·1mo ago·source ↗

AI #168: Not Leading the Future

Zvi Mowshowitz's weekly AI roundup issue #168, characterized by the author as a 'lull' period in AI news. As a Tier 2 commentary source, this is a curated synthesis of recent AI/ML developments across the landscape. The brief body excerpt suggests a relatively quiet week in frontier AI activity.

6Interconnects·1mo ago·source ↗

Latest open artifacts (#21): Open model bonanza — Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others

Interconnects' recurring open-weights roundup covers a dense cluster of recent releases including Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, and GLM-5.1, characterizing the period as a flagship-after-flagship cadence. The piece also includes commentary on CAISI's assessment of DeepSeek V4. As a tier-2 commentary source, this is a synthesis and analysis layer rather than primary announcements.

4Import Ai·1mo ago·source ↗

Import AI 456: RSI and Economic Growth, AI Regulation Optionality, and Neural Computer

Import AI issue 456 covers three topics: recursive self-improvement (RSI) and its implications for economic growth, frameworks for 'radical optionality' in AI regulation, and a neural computer architecture. The newsletter synthesizes recent developments in AI capability trajectories and governance approaches. As a tier-2 commentary source, it provides synthesis and analysis rather than primary research.