What this thread covers
Frontier model releases are the headline-grade checkpoints — new models, major version bumps, and architectural pivots — from the labs operating at the edge of what large-scale AI can do. This thread traces that lineage from the GPT-3 paper through the safety-tiered, government-contested deployments of mid-2026, covering OpenAI, Anthropic, Google DeepMind, Meta, Mistral AI, and MiniMax.
---
Phase 1 — The Scaling Thesis (2020–2022)
The modern frontier begins with a paper: OpenAI's GPT-3 (May 2020), a 175-billion-parameter autoregressive model that demonstrated strong few-shot performance across NLP tasks without gradient updates. The core claim — that scaling model size dramatically improves task-agnostic capability — became the organizing thesis of the field.
The thesis went mass-market in November 2022 with ChatGPT, a conversational wrapper on top of that lineage that could answer follow-up questions, acknowledge errors, and decline inappropriate requests. The dialogue format made the technology legible to non-specialists and triggered a wave of public and enterprise adoption that reshaped the competitive landscape.
---
Phase 2 — Multimodality and Open Weights (2023–mid-2024)
The next phase expanded the modality surface and opened the weights. Mistral AI's Mixtral 8x7B (December 2023) introduced a sparse mixture-of-experts architecture — 46.7B total parameters, 12.9B active per token — that matched or exceeded GPT-3.5 on most benchmarks under an Apache 2.0 license, demonstrating that open-weight models could reach near-frontier quality at a fraction of the inference cost.
OpenAI's GPT-4o (May 2024) moved in the opposite direction: a natively omnimodal architecture processing audio, vision, and text in a unified model without separate pipeline stages, positioned as the primary production model going forward. Meta's Llama 3.1 (July 2024) pushed open weights to 405B parameters — the largest open-weight release to that point — with multilingual support and extended context.
---
Phase 3 — Inference-Time Scaling and the Reasoning Turn (late 2024)
OpenAI's o1 (September 2024) introduced a new capability axis: inference-time compute scaling via chain-of-thought reasoning trained with reinforcement learning. Rather than simply making models larger, o1 spent more compute at inference to reason through problems step by step. The o1-preview ranked in the 89th percentile on competitive programming and performed at PhD level on science benchmarks. This reframed the scaling conversation: training-time compute was no longer the only lever.
Anthropic's Claude 3 family (launched with Haiku, Sonnet, and Opus) established the tiered-model pattern — multiple capability levels under a single brand — with Opus claiming top benchmark positions and a 200K context window with near-perfect recall.
---
Phase 4 — Agentic Coding and the SWE-Bench Era (mid–late 2025)
By mid-2025, the benchmark that mattered most for practitioner credibility was SWE-bench Verified — the fraction of real GitHub issues an AI can resolve autonomously. The upgraded Claude 3.5 Sonnet (July 2025) moved that number from 33.4% to 49.0%, surpassing all publicly available models including reasoning models at the time. Computer use — Claude controlling a desktop by viewing screens, moving cursors, and typing — launched in public beta simultaneously.
GPT-5 (August 2025) arrived with a unified routing architecture (gpt-5-main, gpt-5-thinking, and lightweight variants like gpt-5-thinking-nano) disclosed in its system card, claiming state-of-the-art across coding, mathematics, writing, and visual perception. Alongside it, OpenAI made its first significant open-weight move: gpt-oss-120b and gpt-oss-20b under Apache 2.0, optimized for consumer hardware deployment.
Claude Opus 4 and Sonnet 4 (September 2025) pushed SWE-bench to 72.5% and 72.7% respectively, with hybrid near-instant and extended thinking, parallel tool execution, and Claude Code going generally available with GitHub Actions and IDE integrations. Claude 3.7 Sonnet, released the same day, was positioned as the first hybrid reasoning model — a single model operating in both standard and extended thinking modes.
Claude Sonnet 4.5 (November 2025) extended the agentic lead with a 61.4% score on OSWorld (up from 42.2% for Sonnet 4), a native VS Code extension, and a Claude Agent SDK giving developers access to the same infrastructure powering Claude Code.
Google DeepMind's Gemini 3 (November 2025) and Gemini 3.5 (May 2026) marked the Gemini line's pivot toward agentic capabilities and complex workflow execution.
---
Phase 5 — Novel Science and Mathematical Reasoning (late 2025–early 2026)
A qualitatively new capability claim emerged in this period: frontier models producing novel, verifiable scientific results rather than assisting with known work.
- Gemini with Deep Think achieved externally validated gold-medal standard at IMO 2025 (October 2025) — six problems across algebra, combinatorics, geometry, and number theory, judged by the same standard as human competitors.
- GPT-5.2 (December 2025) proposed a novel formula for a gluon amplitude in theoretical physics, subsequently formally proved by OpenAI researchers and academic collaborators.
- An unnamed OpenAI model disproved an 80-year-old conjecture in discrete geometry — the unit distance problem — announced via the OpenAI blog (May 2026).
- MiniMax's MaxProof (June 2026) scored 35/42 on IMO 2025 and 36/42 on USAMO 2026 using population-level test-time scaling over a tournament of candidate proofs generated by their MiniMax-M3 model.
These results collectively shift the conversation from "AI assists researchers" to "AI produces research."
---
Phase 6 — Safety Tiers, Geopolitics, and the Governance Frontier (2026)
The most recent releases have introduced a new dimension: not just what models can do, but what they are permitted to do — and by whom.
Claude Opus 4.5 (March 2026) and Claude Opus 4.6 (March 2026) continued the capability march — Opus 4.6 with a 1M-token context window in beta, adaptive thinking with developer-controlled effort levels, and a claimed 144-Elo lead over GPT-5.2 on GDPval-AA. Pricing held at $5/$25 per million tokens.
Claude Fable 5 and Mythos 5 (June 2026) represent the most operationally complex frontier release to date. Mythos 5 is restricted-access, capable of cracking previously secure software. Fable 5 is the general-use version, featuring novel safety classifiers that block or degrade responses on cybersecurity, biology, chemistry, and AI-development topics — including, initially, undisclosed capability degradation applied silently via prompt modification or steering vectors, which sparked controversy before Anthropic modified the policy. Both models set new state-of-the-art results across software engineering, agentic coding, knowledge work, and scientific reasoning, at roughly half the cost of the prior Claude Mythos Preview.
Within 24 hours of the Fable 5 / Mythos 5 launch, the US government issued an export-control directive requiring Anthropic to disable both models for all foreign nationals, citing awareness of a jailbreak method. Anthropic complied while publicly disputing the standard applied — arguing that requiring perfect jailbreak resistance would halt all frontier model deployments industry-wide, noting that the demonstrated technique produces results already achievable by GPT-5.5. This is the first documented case of a US government export-control action forcing a frontier lab to suspend model access.
The governance conflict is not new: Anthropic had already publicly refused DoD demands (February 2026) to remove safeguards on mass domestic surveillance and fully autonomous weapons, and resisted a "supply chain risk" designation. The Fable 5 suspension marks the escalation from threat to enforcement.
---
Where the frontier is heading
The events in this bundle point toward three converging pressures:
1. Capability: Models are crossing from benchmark performance into verifiable novel scientific contribution. The next generation of frontier releases will likely be evaluated partly on whether they produce publishable results, not just benchmark scores.
2. Agentic deployment: Claude Code, computer use, and autonomous cyber operations (including the first documented AI-orchestrated espionage campaign, using Claude Code as an autonomous agent) have made agentic deployment the primary product surface. Future releases will be judged on long-horizon task reliability, not single-turn quality.
3. Governance: The Fable 5 export-control action and the DoD safeguard dispute signal that frontier model releases are now geopolitical events. The question of who controls what a model can do — and for whom — is no longer abstract. It is being litigated in real time between labs, governments, and the public.




