Almanac
Topic guide · In-depth

Open Weights Progress: From Llama 2 to Frontier Parity

Open Weights ProgressIn-depthactive·v1 · live·generated 7d ago

Part of these paths

TL;DRThe open-weights ecosystem has undergone a structural transformation — from a handful of permissively licensed models lagging well behind closed frontier systems to a multi-lab competition where open models now claim parity on reasoning, coding, and agentic benchmarks. The gap that once separated open from closed has narrowed to the point where the dominant tensions have shifted: from capability to safety governance, from model quality to inference economics, and from individual releases to the infrastructure and tooling that makes them deployable at scale.

Key takeaways

  • Mixtral 8x7B (Dec 2023) established sparse MoE as the dominant open-weights efficiency pattern, activating only 12.9B of 46.7B parameters per token while matching or exceeding GPT-3.5.
  • DeepSeek-R1 (MIT license, weights + outputs) claimed parity with OpenAI o1 on math, code, and reasoning, with API pricing at $0.55/$2.19 per million tokens — a fraction of comparable closed-model costs.
  • OpenAI entered the open-weights space in August 2025 with gpt-oss-120b and gpt-oss-20b under Apache 2.0, a significant strategic reversal for a historically closed lab.
  • DeepSeek-V4-Pro (1.6T total / 49B active parameters, 1M context by default) claims open-source SOTA on agentic coding, rivaling top closed models.
  • Qwen3-Coder-480B-A35B claims open-weight SOTA on agentic coding and browser-use, with performance described as comparable to Claude Sonnet 4.
  • GGML and llama.cpp joined Hugging Face in February 2026, consolidating the key local-inference stack under a single open-source umbrella.

What this area covers

Open-weights progress tracks the multi-year effort to make frontier-class language models publicly available — weights downloadable, deployable on private infrastructure, and fine-tunable without API intermediaries. The thread spans model releases from Meta, Mistral AI, DeepSeek, Alibaba's Qwen team, Google DeepMind, and — most recently — OpenAI, as well as the inference infrastructure, licensing regimes, and safety debates that shape how those models are actually used.

Why it matters

The practical stakes are high on multiple axes. For practitioners, open weights mean the ability to fine-tune on proprietary data, run inference behind a firewall, and avoid per-token API costs at scale. For the broader AI ecosystem, the gap between open and closed models is a proxy for how concentrated frontier capability is — and how quickly that concentration can be disrupted. For safety researchers and policymakers, open weights introduce irreversible proliferation: once weights are public, they cannot be recalled.

Phase 1: Establishing the baseline (2022–2023)

The modern open-weights era begins with BLOOM (176B, July 2022), a collaborative multilingual model from Hugging Face and the BigScience workshop — the first open model at GPT-3 scale. Meta's Llama 2 (July 2023) shifted the dynamic: a well-resourced frontier lab releasing competitive weights under a broadly permissive license, distributed through Hugging Face with Microsoft as a partner.

Mistral AI then demonstrated that a small team could punch above its weight. Mistral 7B (September 2023, Apache 2.0) outperformed Llama 2 13B across all evaluated benchmarks using Grouped-Query Attention and Sliding Window Attention for efficient inference. Three months later, Mixtral 8x7B (December 2023, Apache 2.0) introduced sparse Mixture-of-Experts to the open ecosystem: 46.7B total parameters, only 12.9B active per token, matching or exceeding GPT-3.5 at the inference cost of a 12.9B dense model. This architectural pattern — large total capacity, small active footprint — became the template for nearly every major open release that followed.

Phase 2: Scaling and multimodality (2024)

2024 was defined by scale races and capability expansion. Meta released Llama 3 (April 2024), then Llama 3.1 (July 2024) with a 405B flagship, multilingual support, and extended context — the first open model credibly positioned as frontier-class at release. Mistral followed with Mixtral 8x22B (April 2024, Apache 2.0, 141B total / 39B active, 64K context) and Mistral Large 2 (July 2024, 123B, 128K context, 80+ coding languages).

Alibaba's Qwen team emerged as a major force. Qwen2 (June 2024) introduced 128K context and strong multilingual coverage. Qwen2.5 (September 2024) was described as potentially the largest open-source model release in history by parameter count across the full family. Qwen2.5-Coder-32B (November 2024) claimed parity with GPT-4o on coding benchmarks — a significant milestone for a specialized open model.

Multimodality arrived in open weights: Llama 3.2 (September 2024) added vision-capable models alongside 1B/3B edge variants, and Qwen2.5-VL (January 2025) delivered a 72B vision-language model across three sizes.

Phase 3: Reasoning, agentic capability, and frontier parity (2025–2026)

The most consequential shift was DeepSeek-R1 (MIT license, weights and outputs freely usable for distillation). Claiming parity with OpenAI o1 on math, code, and reasoning benchmarks, with six distilled smaller variants and API pricing at $0.55/$2.19 per million tokens, R1 demonstrated that reasoning-class capability was no longer a closed-lab exclusive. DeepSeek-V3 (671B MoE, 37B active, 14.8T training tokens, 60 tokens/second) followed as a fully open-source frontier alternative with API pricing at $0.27/$1.10 per million tokens.

Mistral expanded its open portfolio into reasoning with Magistral Small (24B, Apache 2.0, 70.7% on AIME2024), coding agents with Devstral 2 (123B, 72.2% SWE-bench Verified, 256K context), and speech with Voxtral (24B and 3B, Apache 2.0, outperforming Whisper large-v3). Qwen3 (April 2025) brought a 235B MoE flagship claiming competitive performance against DeepSeek-R1, OpenAI o1/o3-mini, Grok-3, and Gemini-2.5-Pro.

The most strategically significant event of 2025 was OpenAI's entry into open weights. In August 2025, OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0 — a direct reversal of its historically closed posture, driven by competitive pressure and framed around accessibility and global reach. The release was accompanied by a safety evaluation methodology (malicious fine-tuning, or MFT) designed to assess worst-case risks before open-weight releases, signaling that safety governance for open models was becoming a first-class concern.

Into 2026, the frontier continued to move. Mistral Large 3 (675B MoE / 41B active, Apache 2.0) debuted at #2 on LMArena's OSS non-reasoning leaderboard. DeepSeek-V4-Pro (1.6T total / 49B active, 1M context by default via Token-wise compression and DeepSeek Sparse Attention) claimed open-source SOTA on agentic coding. Qwen3-Coder-480B-A35B (480B MoE, 256K native context, 1M via extrapolation) claimed open-weight SOTA on agentic coding and browser-use, with performance described as comparable to Claude Sonnet 4. Mistral Medium 3.5 (128B dense, 77.6% SWE-Bench Verified, 256K context, runs on four GPUs) demonstrated that dense models could remain competitive with MoE at the 128B scale. Google DeepMind released Gemma 4 and a Gemma 4 12B with a unified encoder-free multimodal architecture.

The infrastructure layer

Model releases are only half the story. The serving and fine-tuning stack that makes open weights usable has matured in parallel. vLLM, SGLang, llama.cpp, and Transformers are the dominant inference frameworks; NVIDIA NIM provides enterprise packaging with day-0 support for many releases. The most significant infrastructure event was GGML and llama.cpp joining Hugging Face in February 2026, consolidating the primary local-inference stack — which underpins most consumer and on-device deployments — under a single open-source organization with long-term sustainability backing.

Hugging Face itself functions as the de facto distribution layer: nearly every major open-weights release in this bundle is available there, often on the same day as the lab announcement.

Safety and governance tensions

Open weights introduce irreversible proliferation risk that closed APIs do not. Two distinct concerns have crystallized in the events:

Distillation attacks. Anthropic publicly identified DeepSeek, Moonshot AI, and MiniMax as conducting coordinated large-scale distillation campaigns against Claude — generating over 16 million exchanges through approximately 24,000 fraudulent accounts to train competing models. Anthropic framed this as a national security concern, arguing that illicitly distilled models strip out safety safeguards and undermine export controls.

Malicious fine-tuning. OpenAI introduced a methodology called malicious fine-tuning (MFT) to assess worst-case risks of open-weight releases, specifically probing for dangerous capability uplift in biology and cybersecurity domains. This represents an emerging norm: safety evaluation before open release, not just before API deployment.

Meta's Muse Spark (April 2026) — the first closed-weights model from Meta's Superintelligence Labs — signals that even the most committed open-weights lab is hedging: some capability tiers may remain proprietary regardless of the competitive environment.

Where it is heading

The open-weights frontier is no longer defined by the capability gap to closed models — on coding, math, and reasoning, that gap has largely closed. The active frontiers are:

  • Agentic and long-context capability: 1M-token context windows and tool-integrated chain-of-thought are now table stakes for flagship open releases.
  • Inference economics: MoE architectures, sparse attention (DeepSeek Sparse Attention, DSA), and aggressive API price cuts are driving the cost of frontier-class inference toward commodity levels.
  • Safety governance for open weights: MFT evaluations, distillation detection, and the question of which capability tiers should remain closed are unresolved and increasingly contested.
  • Infrastructure consolidation: The acquisition of llama.cpp/GGML by Hugging Face suggests the ecosystem is moving toward a more unified, sustainably maintained serving stack rather than a fragmented collection of independent projects.

Open-weights ecosystem: labs, models, and infrastructure

Selected open-weights flagship models across the major labs

ModelParams (total / active)LicenseKey claimContext
Llama 3.1 405B405B / 405BMeta customFrontier-class open model at release128K
Mixtral 8x7B46.7B / 12.9BApache 2.0Matches/exceeds GPT-3.5; 2× inference speed32K
DeepSeek-R1MITParity with OpenAI o1 on math/code/reasoning
DeepSeek-V4-Pro1.6T / 49BOpen-weightsOpen-source SOTA agentic coding; 1M context default1M
Qwen3-235B-A22B235B / 22BOpen-weightsCompetitive with DeepSeek-R1, o1, Gemini-2.5-Pro
Qwen3-Coder-480B-A35B480B / 35BOpen-weightsOpen-weight SOTA agentic coding; comparable to Claude Sonnet 4256K (1M via extrapolation)
Mistral Large 3675B / 41BApache 2.0#2 LMArena OSS non-reasoning leaderboard at release
Mistral Medium 3.5128B / 128B (dense)Modified MIT77.6% SWE-Bench Verified; runs on 4 GPUs256K
gpt-oss-120b120B / —Apache 2.0OpenAI's first open-weights reasoning model
Gemma 4 12B12B / 12BOpen-weightsEncoder-free unified multimodal architecture

Cells marked — indicate the events bundle does not disclose that value. Active params shown where MoE architecture is confirmed.

Timeline

  1. Llama 2 released — open-weights frontier access widens

  2. Mistral 7B ships under Apache 2.0, outperforms Llama 2 13B

  3. Mixtral 8x7B establishes sparse MoE as the efficiency template

  4. Llama 3.1 405B — first frontier-class open model from a top lab

  5. Llama 4 Maverick & Scout launch on Hugging Face

  6. OpenAI releases gpt-oss-120b/20b under Apache 2.0 — closed lab enters open-weights

  7. Mistral Large 3 (675B MoE) debuts at #2 on LMArena OSS leaderboard

  8. GGML and llama.cpp join Hugging Face — local inference stack consolidates

  9. Gemma 4 released — Google DeepMind's most capable open models

  10. Mistral Medium 3.5 (128B dense, 77.6% SWE-Bench) runs on 4 GPUs

Related topics

FAQ

Are open-weights models now as capable as closed frontier models?

On several benchmarks — particularly coding, math, and reasoning — open-weights models like DeepSeek-R1, Qwen3-Coder-480B, and DeepSeek-V4-Pro now claim parity or near-parity with top closed systems. Gaps remain in some agentic and multimodal tasks, and closed labs continue to push the frontier.

What is the difference between 'open-weights' and 'open-source'?

Open-weights means the model parameters are publicly released for download and use; open-source additionally implies the training code, data, and full methodology are available. Most releases in this space are open-weights only, though licenses vary widely from Apache 2.0 (permissive commercial use) to custom research-only terms.

What is sparse MoE and why does it dominate large open models?

Sparse Mixture-of-Experts (MoE) routes each token through only a subset of the model's parameters — for example, Mixtral 8x7B activates 12.9B of 46.7B total parameters per token — so inference cost scales with active parameters, not total parameters. This lets labs release very large models that run at the cost of a much smaller dense model.

What is the distillation risk that Anthropic raised?

Distillation attacks involve generating large volumes of model outputs from a proprietary system and using them to train a separate model, effectively transferring capability without authorization. Anthropic identified DeepSeek, Moonshot AI, and MiniMax as conducting such campaigns against Claude at scale (over 16 million exchanges), and argued the resulting models strip out safety safeguards.

Why did OpenAI release open-weights models in 2025?

OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0 in August 2025, framing the move as a step toward broader AI accessibility. The release marked a significant strategic shift for a lab that had historically kept all frontier models proprietary, likely driven by competitive pressure from Meta, Mistral, DeepSeek, and Qwen.

What infrastructure underpins local open-weights deployment?

The primary local inference stack is llama.cpp (C++ runtime with quantization) and GGML (the underlying tensor library), both of which joined Hugging Face in February 2026. On the serving side, vLLM, SGLang, and Transformers are the dominant frameworks, with NVIDIA NIM providing day-0 enterprise packaging for many releases.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live7d ago

Related guides (4)

More on Open Weights Progress (6)

5Hugging Face Blog·1mo ago·source ↗

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context

IBM released Granite Embedding Multilingual R2, an open-weights (Apache 2.0) multilingual embedding model with 32K context window, claiming best-in-class retrieval quality among sub-100M parameter models. The model is positioned for enterprise RAG and retrieval use cases across multiple languages. It is hosted and announced via Hugging Face.

6Interconnects·1mo ago·source ↗

Latest open artifacts (#21): Open model bonanza — Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others

Interconnects' recurring open-weights roundup covers a dense cluster of recent releases including Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, and GLM-5.1, characterizing the period as a flagship-after-flagship cadence. The piece also includes commentary on CAISI's assessment of DeepSeek V4. As a tier-2 commentary source, this is a synthesis and analysis layer rather than primary announcements.

5Interconnects·1mo ago·source ↗

How Open Model Ecosystems Compound

This Interconnects commentary examines how China's open-first, high-participation AI ecosystem creates compounding advantages over time. The piece reflects on the structural dynamics of open model ecosystems and their strategic implications. It appears to analyze how broad community participation in open-weight model development accelerates capability progress.

6Interconnects·1mo ago·source ↗

Notes from inside China's AI labs

A firsthand account from visits to leading AI labs in China, offering observations on their research culture, capabilities, and strategic direction. The piece provides rare insider perspective on the state of Chinese frontier AI development. Published on Interconnects, a tier-2 commentary source focused on the AI/ML landscape.

5Hugging Face Blog·1mo ago·source ↗

EMO: Pretraining Mixture of Experts for Emergent Modularity

AllenAI introduces EMO, a pretraining approach for Mixture of Experts (MoE) models that aims to produce emergent modularity during training. The work explores how MoE architectures can develop specialized expert routing without explicit supervision. Published on the Hugging Face blog, this represents research-level work on improving MoE training dynamics and efficiency.

5Interconnects·1mo ago·source ↗

The Distillation Panic

A commentary piece from Interconnects critiques the framing of 'distillation attacks' as a term for the current trend of training models on outputs from frontier systems. The author appears to argue the terminology is misleading or alarmist. The piece engages with ongoing industry debate about knowledge distillation, model output licensing, and competitive dynamics between AI labs.