Step 5 of 8 in Training Infrastructure: The Stack Behind Modern AINext: OpenAI →

Concept guide · In-depth

Mixture of Experts: Sparse Routing for Scalable, Efficient AI

Mixture of ExpertsIn-depthactive·v1 · live·generated 6d ago

Part of these paths

Multimodal Progress · Step 5 of 7
Open Weights Progress · Step 3 of 7
Training Infrastructure · Step 5 of 8

TL;DRMixture of Experts (MoE) is the architectural pattern that lets AI models grow very large in total parameter count while keeping per-token compute fixed — only a small subset of "expert" sub-networks fires for any given input. What began as a research curiosity became the dominant design for frontier open-weight models, and the field is now pushing MoE in every direction: down to mobile devices, out to multimodal and agentic workloads, and deeper into training efficiency and dynamic sparsity.

Key takeaways

DeepSeek-V3 (671B total / 37B active) runs at 60 tokens/second and is priced at $0.27/$1.10 per million tokens — demonstrating that MoE makes frontier-scale models economically viable.
Mistral Small 4 (119B total / 6B active) claims 40% latency reduction and 3× throughput over its dense predecessor, unifying reasoning, multimodal, and coding in one open-weights Apache 2.0 model.
MobileMoE shows MoE scaling laws extend to sub-billion active-parameter on-device models, delivering 2–4× fewer inference FLOPs and 1.8–3.8× faster prefill than dense baselines on smartphones.
ZEDA demonstrates post-hoc dynamic sparsification: self-distillation can eliminate over 50% of expert FLOPs from a static trained MoE with only marginal accuracy loss.
Complete-muE solves the hyperparameter transfer problem across dense-to-MoE architecture changes, enabling a 'tune dense once, transfer to all' recipe without costly re-tuning.
Qwen's global-batch load balancing addresses expert imbalance — a core MoE training bottleneck — and is described as a near 'free lunch' improvement.

What it is

Mixture of Experts (MoE) is a neural network architectural pattern in which a model's feed-forward layers are replaced by a collection of parallel "expert" sub-networks, plus a learned router that selects only a small subset of those experts to process each token. The key invariant: total parameter count and per-token compute are decoupled. A 671B-parameter model like DeepSeek-V3 activates only 37B parameters per token — roughly the compute budget of a mid-size dense model — while retaining the representational capacity of a much larger one.

How it works

In a standard transformer, each layer has one feed-forward network (FFN) that every token passes through. In a sparse MoE layer, that single FFN is replaced by N expert FFNs and a router network. The router scores each token against all experts and selects the top-k (typically 2) by score; only those experts compute an output, which is then weighted and summed. The rest of the experts do no work for that token.

This creates two distinct parameter counts that practitioners must track:

Total parameters: the full weight of all experts combined — determines memory footprint and storage.
Active parameters: the weights actually used per token — determines inference FLOPs and latency.

Mixtral's original open-weight release used 8 experts with 2 active per token. DeepSeek-V3 scales this to a much finer granularity. Mistral Small 4 activates only 6B of its 119B total parameters per token.

Why it matters

MoE is the reason frontier-scale open-weight models are economically deployable. DeepSeek-V3 runs at 60 tokens/second — three times faster than its predecessor — and is priced at $0.27/$1.10 per million input/output tokens, a fraction of comparable dense-model pricing. Mistral Small 4 reports a 40% latency reduction and 3× throughput improvement over Mistral Small 3 while unifying reasoning, multimodal, and coding capabilities that previously required separate models. The pattern has become the default for any lab that wants to push capability without proportionally scaling inference cost.

The open-weight MoE landscape

The events bundle shows a dense cluster of open-weight MoE releases across the capability spectrum:

Frontier coding/agentic: Qwen3-Coder (480B/35B active, 256K context, claims parity with Claude Sonnet 4 on agentic coding); GLM-5.1 (754B/40B active, MIT license, designed for 8-hour agentic coding sessions with thousands of tool calls).
Unified multimodal: Mistral Small 4 (119B/6B active, Apache 2.0, native text+image, configurable reasoning effort); TML-Interaction-Small (276B, audio/video/text, 200ms micro-turns for real-time interaction).
Efficient small-scale: Qwen1.5-MoE-A2.7B matches 7B dense models at one-third the activated parameters. MobileMoE pushes MoE scaling laws to 0.3–0.9B active parameters, achieving 2–4× fewer inference FLOPs and 1.8–3.8× faster prefill than dense baselines on commodity smartphones.

Training challenges and mitigations

MoE introduces training difficulties absent from dense models:

Expert load imbalance is the most persistent: routers collapse onto a few popular experts, wasting capacity and degrading quality. Qwen's global-batch load balancing addresses this at the router level and is described as a near "free lunch" improvement. AllenAI's EMO pretraining approach explores whether emergent modularity — spontaneous expert specialization — can be induced without explicit supervision.

Hyperparameter transfer across architecture changes (dense → MoE, or changing tokens-per-expert ratios) has historically required expensive re-tuning. Complete-muE solves this with a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale; Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical result is a "tune dense once, transfer to all" recipe.

Agent-native training frameworks: PithTrain is a new MoE training system designed so AI coding agents can efficiently understand and extend it, introducing the agent-task efficiency (ATE) metric. It matches production-framework throughput while reducing agent interaction overhead by up to 62%.

Post-training efficiency: dynamic sparsification

A significant recent direction is making static trained MoE models more sparse after the fact. ZEDA (Zero-Expert Self-Distillation Adaptation) injects parameter-free zero-output experts into each MoE layer and uses two-stage self-distillation with the original model as a frozen teacher. Applied to Qwen3-30B-A3B and GLM-4.7-Flash, it eliminates over 50% of expert FLOPs with marginal accuracy loss and achieves approximately 1.20× end-to-end inference speedup — outperforming the strongest dynamic MoE baseline by 4–6 points across 11 benchmarks.

Inference and serving considerations

MoE models impose a distinctive serving profile: high memory (all experts must be loaded) but low per-token compute. This creates specific challenges for batching and power management. PALS, a power-aware inference runtime integrated into vLLM, treats GPU power caps as a first-class scheduling parameter and achieves up to 26.3% energy efficiency improvement with 4–7× fewer QoS violations across both dense and MoE deployments. The training-free looped transformer technique — retrofitting recurrence onto frozen checkpoints by reapplying mid-stack blocks — has been validated across sparse MoE and MLA+MoE architectures, yielding consistent benchmark improvements at no training cost.

Beyond language models

MoE has migrated well beyond LLMs. SegMoE applies the architecture to diffusion models for image generation. HANDOFF uses multi-teacher KL distillation into a MoE student to unify three specialist controllers (whole-body motion tracking, locomotion, fall-recovery) for humanoid robots, enabling natural-language-driven task execution on physical hardware. ProtoAda and CRAM both use MoE structures (Mixture-of-LoRA-Experts) for multimodal continual instruction tuning, addressing the catastrophic forgetting problem in sequential fine-tuning. FAME applies a lightweight MoE router to log anomaly detection in production systems, achieving F1=98.16 on BGL with 76× annotation reduction.

Where it's heading

The events point to three concurrent frontiers. First, scale with efficiency: the race is no longer just total parameters but the ratio of capability to activated compute, with dynamic sparsification (ZEDA) and better load balancing (Qwen global-batch) tightening that ratio post-hoc. Second, edge deployment: MobileMoE's on-device scaling laws suggest MoE will become the default architecture for capable on-device models, not just cloud inference. Third, enterprise training infrastructure: Mistral's Forge platform explicitly supports MoE pre-training and post-training for enterprise custom models, signaling that MoE is becoming a first-class option in the enterprise training stack, not just a research or hyperscaler concern.

MoE layer: sparse routing through expert FFNs

MoE deployment spectrum (total → active parameters)

Representative MoE models across the deployment spectrum

Model	Total params	Active params	Context	Notable
DeepSeek-V3	671B	37B	—	60 tok/s; $0.27/$1.10 per M tokens; open weights
Qwen3-Coder-480B	480B	35B	256K (1M via extrapolation)	SOTA open-weight agentic coding; comparable to Claude Sonnet 4
GLM-5.1	754B	40B	—	8-hour agentic coding sessions; MIT license
TML-Interaction-Small	276B	—	—	200ms micro-turns; audio/video/text; encoder-free early fusion
Mistral Small 4	119B	6B	256K	Apache 2.0; 40% latency reduction vs. predecessor
Qwen1.5-MoE-A2.7B	~14B	2.7B	—	Matches 7B dense models at 1/3 activated params
MobileMoE	1.3–5.3B	0.3–0.9B	—	On-device; 2–4× fewer FLOPs than dense baselines

Cells drawn from the events bundle; unknown cells render —.

Timeline

FAQ

What is the core efficiency claim of MoE?

Total parameter count (and thus model capacity) scales independently of per-token compute: only a small fraction of experts activates per token, so you get a large model's knowledge at a small model's inference cost.

What is the main training challenge unique to MoE?

Expert load imbalance — the router tends to over-use a few experts and starve others, degrading both efficiency and quality. Techniques like Qwen's global-batch load balancing address this directly.

Can MoE models be made more efficient after training?

Yes — ZEDA shows that post-hoc self-distillation can eliminate over 50% of expert FLOPs from a static trained MoE with only marginal accuracy loss, without retraining from scratch.

Does MoE work at mobile/on-device scale?

MobileMoE demonstrates it does: models with 0.3–0.9B active parameters achieve 2–4× fewer inference FLOPs and 1.8–3.8× faster prefill than dense baselines on commodity smartphones.

Is MoE limited to language models?

No — SegMoE applies the architecture to diffusion models for image generation, and HANDOFF uses MoE distillation for humanoid robot whole-body control, showing the pattern generalizes well beyond LLMs.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

v1live6d ago

Related guides (4)

Mixture of ExpertsConcept

Mixture of Experts: How AI Models Do More by Using Less

Read asBeginner

LLM-as-a-JudgeConcept

LLM-as-a-Judge: Using AI to Grade AI

Read asBeginner In-depth

knowledge distillationConcept

Knowledge Distillation: Teaching Small Models to Punch Above Their Weight

Read asBeginner In-depth

Diffusion ModelsConcept

Diffusion Models: Mechanism, Variants, and the Push Toward Efficient Sampling

Read asIn-depth

More on Mixture of Experts (6)

6Qwen Research·1mo ago·source ↗

Global-batch Load Balancing for MoE LLM Training from Qwen

Qwen Research introduces a global-batch load balancing technique for Mixture-of-Experts (MoE) LLM training, claiming it is nearly a 'free lunch' improvement. The method addresses expert load imbalance across training batches, a known efficiency and quality bottleneck in MoE architectures. The approach targets the router and expert activation dynamics in transformer-based MoE layers.

Training Infrastructure Frontier Model Releases Global-batch Load Balancing Alibaba Qwen +1 more

5Hugging Face Blog·1mo ago·source ↗

Mixture of Experts Explained

This Hugging Face blog post provides a technical overview of the Mixture of Experts (MoE) architecture, explaining how sparse gating mechanisms route tokens to subsets of expert feed-forward layers to achieve computational efficiency. The post covers training dynamics, inference considerations, and the tradeoffs between dense and sparse models. It serves as a reference document contextualizing MoE's growing relevance following high-profile model releases using the architecture.

Training Infrastructure Frontier Model Releases Mixture of Experts Hugging Face sparse gating +1 more

4Hugging Face Blog·1mo ago·source ↗

Mixture of Experts (MoEs) in Transformers

A Hugging Face blog post covering Mixture of Experts (MoE) architectures as applied to transformer models. The post likely explains the technical foundations, training considerations, and practical deployment aspects of MoE models. Given the timing in early 2026, it likely contextualizes recent MoE-based frontier models and tooling support within the Hugging Face ecosystem.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts Hugging Face +1 more

6arXiv · cs.CL·1mo ago·source ↗

ZEDA: Post-Trained MoE Models Can Skip Half Their Experts via Self-Distillation

This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework that converts static post-trained Mixture-of-Experts (MoE) language models into dynamic ones without pre-training from scratch. ZEDA injects parameter-free zero-output experts into each MoE layer and uses two-stage self-distillation with the original model as a frozen teacher. Applied to Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA eliminates over 50% of expert FLOPs with marginal accuracy loss and achieves approximately 1.20× end-to-end inference speedup, outperforming the strongest dynamic MoE baseline by 4–6 points.

Training Infrastructure Frontier Model Releases Self-Distillation ZEDA (Zero-Expert Self-Distillation Adaptation)Qwen3-30B-A3B +3 more

7arXiv · cs.LG·26d ago·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)+3 more

7arXiv · cs.CL·24d ago·source ↗

MobileMoE: Scaling Mixture-of-Experts for Sub-Billion Parameter On-Device Deployment

MobileMoE introduces a family of on-device MoE language models with 0.3–0.9B active parameters and 1.3–5.3B total parameters, targeting mobile deployment under memory and compute constraints. The authors derive an on-device MoE scaling law identifying a sweet spot of moderate sparsity with fine-grained and shared experts, then train models through a four-stage recipe including quantization-aware training on open-source data. Across 14 benchmarks, MobileMoE matches or exceeds leading dense on-device LLMs with 2–4× fewer inference FLOPs, and delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than dense baselines on commodity smartphones at comparable INT4 memory.

Training Infrastructure Frontier Model Releases MobileLLM-Pro OLMoE-1B-7B INT4 Quantization +7 more

At a glance

used_in: DeepSeek-V3, Mistral Small 4, Qwen3-Coder, Gemini 3.5 Flash, GLM-5.1, TML-Interaction-Small, SegMoE (diffusion)
category: Sparse neural network architecture
key_idea: Route each token to a small subset of expert FFN layers; total params scale independently of per-token compute
maturity: Production-standard for frontier and open-weight LLMs; emerging for on-device and diffusion
introduced: Popularized in open-weight LLMs from Dec 2023 (Mixtral)
alternatives: Dense transformers, parameter-efficient fine-tuning (LoRA)