Almanac
Topic guide · Beginner

Frontier Model Releases: The Race From Language to Action

Frontier Model ReleasesBeginneractive·v3 · live·generated 6d ago
TL;DRWhat began as a contest over which AI could answer questions most fluently has evolved into a race to build systems that can take actions in the world — writing and running code, controlling computers, and solving problems that once required human experts. The pace has accelerated dramatically, with major labs shipping landmark models every few months and the stakes — commercial, scientific, and geopolitical — rising alongside the capabilities.

Key takeaways

  • GPT-3's 175-billion-parameter paper (May 2020) and ChatGPT's public launch (November 2022) established the modern era of large language models.
  • OpenAI's o1 (September 2024) introduced inference-time 'thinking' as a new axis of capability improvement, separate from training-time scaling.
  • GPT-5 (August 2025) and Claude Opus 4 / Sonnet 4 (September 2025) marked the arrival of models claiming state-of-the-art on software engineering benchmarks like SWE-bench.
  • AI systems have begun producing novel scientific results: GPT-5.2 derived a new formula in theoretical physics, an OpenAI model disproved an 80-year-old geometry conjecture, and Gemini with Deep Think achieved gold-medal standard at IMO 2025.
  • Claude Mythos 5 and Fable 5 (June 2026) introduced safety-tiered deployment — including silent capability restrictions — triggering a U.S. government export control order within days of launch.
  • OpenAI released open-weight models (gpt-oss-120b and gpt-oss-20b, Apache 2.0) in August 2025, a strategic shift for a lab that had kept all frontier models proprietary.

What this area covers

Frontier model releases are the headline events of the modern AI era: the moments when a major lab ships a new AI system that meaningfully expands what machines can do. This thread tracks those releases — and the scientific, commercial, and regulatory shockwaves that follow — from the foundational GPT-3 paper through the safety-tiered deployments of mid-2026.

Why it matters

If you use any AI tool today — a chatbot, a coding assistant, an image generator — it almost certainly traces back to one of the releases in this thread. These models set the ceiling for what AI can do, and that ceiling has risen faster than almost anyone predicted. They also increasingly set the terms of geopolitical competition, regulatory debate, and scientific discovery.

Phase 1: The language era (2020–2023)

The modern story starts with GPT-3 in May 2020 — a 175-billion-parameter model that showed, for the first time at scale, that a single large AI could handle a huge variety of tasks without being specifically trained for each one. You could give it a few examples of what you wanted (called "few-shot learning") and it would figure out the pattern.

That research insight became a product moment in November 2022, when OpenAI launched ChatGPT. The conversational format — ask a question, get an answer, follow up, push back — made AI feel accessible to everyone, not just researchers. Public adoption was unlike anything the tech industry had seen.

Mistral AI entered the picture in December 2023 with Mixtral 8x7B, an open-weight model (free to download and modify) that matched or beat GPT-3.5 on many tasks while running far more efficiently. It signaled that the frontier wasn't exclusively the domain of well-funded American labs.

Phase 2: Seeing, thinking, and acting (2024)

Three releases defined 2024. First, GPT-4o (May 2024) made AI natively multimodal — it could process text, audio, and images in a single unified system, rather than routing through separate pipelines. Second, Meta's Llama 3.1 (July 2024) pushed open-weight models to 405 billion parameters, putting frontier-class capability in the hands of anyone with the hardware to run it. Third — and perhaps most consequential for the long run — OpenAI's o1 (September 2024) introduced a new idea: instead of just making models bigger, you could make them think longer at inference time, using chain-of-thought reasoning to work through hard problems step by step. This "inference-time scaling" opened a new axis of improvement that didn't require ever-larger training runs.

Phase 3: Agents take over (2025)

By 2025, the question had shifted from "can AI answer questions?" to "can AI do things?" The answer, increasingly, was yes.

Anthropic's upgraded Claude 3.5 Sonnet (July 2025) introduced computer use — the ability for an AI to look at a screen, move a cursor, click, and type, just like a human at a keyboard. OpenAI shipped GPT-5 (August 2025), claiming state-of-the-art performance across coding, math, writing, and vision. In a notable strategic shift, OpenAI also released two open-weight models — gpt-oss-120b and gpt-oss-20b — under the permissive Apache 2.0 license, the first time the company had made models freely available for anyone to run and modify.

Claude Opus 4 and Sonnet 4 (September 2025) pushed software engineering benchmarks further, with Opus 4 scoring 72.5% on SWE-bench — a test of real-world coding tasks. Claude Code, Anthropic's autonomous coding tool, became generally available with integrations into GitHub, VS Code, and JetBrains.

Then came a string of results that would have seemed like science fiction a few years earlier. Gemini with Deep Think achieved gold-medal standard at the International Mathematical Olympiad (October 2025). GPT-5.2 derived a genuinely new formula in theoretical physics, subsequently verified by researchers (February 2026). An OpenAI model disproved an 80-year-old conjecture in discrete geometry (May 2026). MiniMax's MaxProof system scored above the human gold-medal threshold on both IMO 2025 and USAMO 2026 (June 2026). AI was no longer just assisting with known science — it was producing new science.

Phase 4: Safety tiers and government friction (2026)

The most recent releases have introduced a new layer of complexity: not just what a model can do, but who gets access to which capabilities under what conditions.

Anthropic's Claude Mythos 5 and Fable 5 (June 2026) represent the clearest example. Mythos 5 is a restricted-access model capable of cracking previously secure software. Fable 5 is the general-use version, but it ships with novel safety classifiers that silently degrade responses on sensitive topics — cybersecurity, biology, chemistry, AI development — via prompt modification or steering vectors. The undisclosed nature of this degradation sparked controversy before Anthropic revised the policy. Within a day of launch, the U.S. government issued an export control directive requiring Anthropic to suspend both models for all foreign nationals, citing awareness of a jailbreak technique. Anthropic complied while publicly disputing the standard applied, arguing that requiring perfect jailbreak resistance would effectively halt all frontier model deployments.

This followed months of tension between Anthropic and the U.S. Department of War over whether Claude could be used for mass domestic surveillance and fully autonomous weapons — uses Anthropic refused to enable regardless of government pressure.

Where it's heading

The trajectory points toward AI systems that are less like tools you query and more like colleagues you assign work to — capable of running for hours, using software, and producing results that require expert verification rather than expert creation. The binding questions are no longer purely technical: they are about who controls access, what uses are permitted, and how governments and companies negotiate those boundaries in real time.

The pace of releases shows no sign of slowing. The gap between a model's technical capability and society's ability to govern it is, if anything, widening.

The arc of frontier model releases: from language to action

Timeline

  1. GPT-3 paper: 175B parameters, few-shot learning

  2. ChatGPT launches — AI goes mainstream

  3. Mixtral 8x7B: open-weight sparse mixture-of-experts from Mistral AI

  4. GPT-4o: natively multimodal (text, audio, vision)

  5. OpenAI o1: inference-time 'thinking' as a new scaling axis

  6. Claude 3.5 Sonnet gains computer use — AI controls a desktop

  7. GPT-5 released; OpenAI also ships open-weight gpt-oss models

  8. Gemini with Deep Think achieves IMO gold-medal standard

  9. Gemini 3 announced; Claude Sonnet 4.5 tops OSWorld at 61.4%

  10. GPT-5.2 derives a new result in theoretical physics

  11. An OpenAI model disproves an 80-year-old geometry conjecture

  12. Claude Mythos 5 & Fable 5: safety-tiered deployment, export control order follows

Related topics

FAQ

What is a 'frontier model'?

A frontier model is the most capable AI system available at a given moment — the cutting edge of what the technology can do. Labs like OpenAI, Anthropic, and Google DeepMind compete to hold this position.

Why do new models keep coming out so fast?

Labs are racing on multiple fronts at once: raw capability, cost, speed, and safety. Each new model typically improves on at least one of these, and competition between labs accelerates the pace.

What does 'agentic AI' mean?

An agentic AI can take sequences of actions — browsing the web, writing and running code, controlling a computer — rather than just answering a single question. Most recent frontier releases have emphasized this capability.

Are any frontier models free to use or modify?

Some are. Meta's Llama 3.1 and OpenAI's gpt-oss-120b and gpt-oss-20b are released as open-weight models under permissive licenses, meaning anyone can download and run them. Most frontier models from OpenAI and Anthropic remain proprietary.

What is the government's role in frontier model releases?

Governments are increasingly involved: the U.S. issued an export control order requiring Anthropic to suspend access to its Mythos 5 and Fable 5 models for foreign nationals, and Anthropic has publicly clashed with the Department of War over what uses of Claude the government can demand.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v3live6d ago
  • v2superseded7d ago
  • v1superseded7d ago

Related guides (4)

More on Frontier Model Releases (6)

7Google Deepmind Blog·1mo ago·source ↗

AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields

DeepMind published a blog post detailing the real-world impact of AlphaEvolve, a Gemini-powered coding agent designed to discover and optimize algorithms. The post covers applications spanning business operations, infrastructure, and scientific research. AlphaEvolve represents a deployment of LLM-driven evolutionary algorithm search at scale across multiple domains.

5Ai Snake Oil·1mo ago·source ↗

Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX

This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.

4One Useful Thing·1mo ago·source ↗

Sign of the Future: GPT-5.5 Commentary

A tier-2 commentary piece from One Useful Thing discusses GPT-5.5 as a notable step in the AI capability curve. The piece frames the release as a signal of future AI development trajectories. As a commentary source, it likely offers analysis of what GPT-5.5's capabilities imply rather than primary technical reporting.

3Don'T Worry About The Vase·1mo ago·source ↗

AI #168: Not Leading the Future

Zvi Mowshowitz's weekly AI roundup issue #168, characterized by the author as a 'lull' period in AI news. As a Tier 2 commentary source, this is a curated synthesis of recent AI/ML developments across the landscape. The brief body excerpt suggests a relatively quiet week in frontier AI activity.

6Interconnects·1mo ago·source ↗

Latest open artifacts (#21): Open model bonanza — Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others

Interconnects' recurring open-weights roundup covers a dense cluster of recent releases including Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, and GLM-5.1, characterizing the period as a flagship-after-flagship cadence. The piece also includes commentary on CAISI's assessment of DeepSeek V4. As a tier-2 commentary source, this is a synthesis and analysis layer rather than primary announcements.

4Import Ai·1mo ago·source ↗

Import AI 456: RSI and Economic Growth, AI Regulation Optionality, and Neural Computer

Import AI issue 456 covers three topics: recursive self-improvement (RSI) and its implications for economic growth, frameworks for 'radical optionality' in AI regulation, and a neural computer architecture. The newsletter synthesizes recent developments in AI capability trajectories and governance approaches. As a tier-2 commentary source, it provides synthesis and analysis rather than primary research.