Concept guide · Beginner

Mixture of Experts: How AI Models Do More by Using Less

Mixture of ExpertsBeginneractive·v1 · live·generated 6d ago

Part of these paths

Multimodal Progress · Step 5 of 7
Open Weights Progress · Step 3 of 7
Training Infrastructure · Step 5 of 8

TL;DRMixture of Experts is a design trick that lets AI models grow very large on paper while staying fast and affordable to run — because only a small slice of the model actually activates for any given input. What started as a niche research idea has become the architecture of choice for many of today's most capable open-source models, from tiny phone-sized assistants to frontier coding giants.

Key takeaways

DeepSeek-V3 has 671B total parameters but activates only 37B per token — running at 60 tokens/second and priced at $0.27/M input tokens.
Mistral's Mixtral (8 experts, 2 active per token) was an early landmark open-weights MoE model, released in December 2023.
MoE now spans the full size spectrum: from MobileMoE's sub-1B active-parameter models for smartphones to 480B+ parameter coding giants like Qwen3-Coder.
The technique has spread beyond language models — it's been applied to image-generation diffusion models (SegMoE) and humanoid robot controllers (HANDOFF).
Active research is tackling MoE-specific pain points: expert load imbalance during training, dynamic expert skipping at inference (ZEDA cuts over 50% of expert compute with marginal accuracy loss), and power-aware serving (PALS achieves up to 26.3% energy efficiency gains).

What is Mixture of Experts?

Imagine a hospital where, instead of one generalist doctor seeing every patient, a triage nurse quickly decides which specialist — cardiologist, neurologist, surgeon — is best suited for each case. Only one or two specialists do the actual work; the rest stay ready but idle. Mixture of Experts (MoE) works the same way inside an AI model.

A standard AI model (called a "dense" model) runs every piece of text through all of its internal components every time. MoE models instead contain many parallel sub-networks called experts, plus a lightweight router that reads each incoming chunk of text (a "token") and decides which two or three experts should handle it. The rest sit unused. The result: a model that can be enormous in total size — storing vast, specialised knowledge — while only activating a small fraction of that size for any given input.

Why should you care?

The practical payoff is speed and cost. DeepSeek-V3, for example, has 671 billion total parameters — a staggering number — but activates only 37 billion per token. It runs at 60 tokens per second and costs just $0.27 per million input tokens. A dense model of equivalent quality would be far slower and more expensive to serve.

This efficiency unlocks things that were previously impractical:

Frontier-quality models at low-cost APIs — multiple labs now offer MoE-based models at prices that make large-scale use affordable.
Capable models on your phone — MobileMoE fits a MoE model into 0.3–0.9 billion active parameters, running 1.8–3.8× faster than comparable dense models on ordinary smartphones.
One model, many skills — Mistral Small 4 packs reasoning, image understanding, and coding into a single 119B-parameter MoE (only 6B active per token), replacing three separate specialist models.

How the router works (simply)

When text arrives, the router scores each expert and picks the top two or three. Those experts process the text and their outputs are blended together. The router is trained alongside the experts, so over time it learns which experts are good at which kinds of content. Research from AllenAI (EMO) explores how this can lead to genuine emergent specialisation — experts that naturally gravitate toward different domains — without anyone explicitly programming that division of labour.

The landscape today

MoE has gone from a research curiosity to the dominant architecture for large open-weights models in just a few years. The December 2023 release of Mixtral — which used 8 experts with 2 active per token and matched much larger dense models — was a turning point that sparked a wave of MoE releases across Alibaba's Qwen family, DeepSeek, Google's Gemini line, and others.

The technique has also spread well beyond language. SegMoE applies it to image-generation diffusion models. The HANDOFF robotics paper uses a MoE student trained from multiple specialist teachers to control a humanoid robot's whole body. Thinking Machines Lab's TML-Interaction-Small is a 276B-parameter MoE that processes audio, video, and text simultaneously in near-real-time.

What researchers are still working on

MoE introduces its own engineering headaches, and the field is actively addressing them:

Load balancing: If the router always sends tokens to the same popular experts, others go underused and training becomes inefficient. Qwen Research published a "global-batch load balancing" technique they describe as nearly a free improvement.
Dynamic expert skipping: ZEDA showed that a post-trained MoE model can skip over 50% of its expert computations with only marginal accuracy loss, giving roughly a 1.2× inference speedup without retraining from scratch.
Power and energy: PALS, a power-aware serving runtime for vLLM, treats GPU power caps as a scheduling variable alongside batch size, achieving up to 26.3% energy efficiency gains for MoE deployments.
Hyperparameter transfer: Complete-muE provides a framework for tuning a dense model once and reliably transferring those settings to a MoE version — cutting the expensive trial-and-error of finding the right training configuration.

The bottom line

MoE is the answer to a real tension in AI: bigger models are more capable, but bigger models are more expensive to run. By activating only the relevant parts of a model for each input, MoE lets developers build very large, knowledgeable systems while keeping inference fast and affordable. It's now the standard approach for anyone building at the frontier — and increasingly, for anyone building for the edge.

How a MoE layer routes a token

MoE in practice: a sample of models from the events

Model	Total params	Active params per token	Notable
DeepSeek-V3	671B	37B	Open-source; $0.27/M input tokens; 60 tok/s
Qwen3-Coder	480B	35B	256K–1M context; agentic coding
GLM-5.1	754B	40B	Up to 8-hour agentic runs; MIT license
Mistral Small 4	119B	6B	Multimodal + reasoning + coding; Apache 2.0
TML-Interaction-Small	276B	—	Audio/video/text; 200ms micro-turns
MobileMoE	1.3–5.3B	0.3–0.9B	Smartphone deployment; 2–4× fewer FLOPs vs dense
Qwen1.5-MoE-A2.7B	~14B	2.7B	Matches 7B dense models at 1/3 the compute

All figures from the events bundle; unknown cells render —.

Timeline

FAQ

If a model has 671 billion parameters, does my computer have to process all of them?

No — that's the whole point of MoE. For any given piece of text, only a small fraction of those parameters (e.g., 37B in DeepSeek-V3) actually switch on. The rest sit idle, keeping speed and cost manageable.

Is a MoE model better than a dense model of the same active-parameter count?

Often yes — the model has seen more total capacity during training, so it can develop more specialised knowledge, while still running at the cost of a smaller dense model.

Can MoE run on a phone or laptop?

Yes. MobileMoE was specifically designed for smartphones, achieving 2–4× fewer compute operations than comparable dense models and running 1.8–3.8× faster on commodity hardware.

Do all the experts learn different things?

Research like AllenAI's EMO suggests experts can develop genuine specialisations during training, but this 'emergent modularity' is still an active area of study rather than a guaranteed outcome.

What's the main downside of MoE?

Serving many experts efficiently is complex — you need to keep all expert weights in memory even though only a few activate at once, which raises memory requirements and creates load-balancing challenges during training.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

v1live6d ago

Related guides (4)

Mixture of ExpertsConcept

Mixture of Experts: Sparse Routing for Scalable, Efficient AI

Read asIn-depth

LLM-as-a-JudgeConcept

LLM-as-a-Judge: Using AI to Grade AI

Read asBeginner In-depth

knowledge distillationConcept

Knowledge Distillation: Teaching Small Models to Punch Above Their Weight

Read asBeginner In-depth

Diffusion ModelsConcept

Diffusion Models: Mechanism, Variants, and the Push Toward Efficient Sampling

Read asIn-depth

More on Mixture of Experts (6)

6Qwen Research·1mo ago·source ↗

Global-batch Load Balancing for MoE LLM Training from Qwen

Qwen Research introduces a global-batch load balancing technique for Mixture-of-Experts (MoE) LLM training, claiming it is nearly a 'free lunch' improvement. The method addresses expert load imbalance across training batches, a known efficiency and quality bottleneck in MoE architectures. The approach targets the router and expert activation dynamics in transformer-based MoE layers.

Training Infrastructure Frontier Model Releases Global-batch Load Balancing Alibaba Qwen +1 more

5Hugging Face Blog·1mo ago·source ↗

Mixture of Experts Explained

This Hugging Face blog post provides a technical overview of the Mixture of Experts (MoE) architecture, explaining how sparse gating mechanisms route tokens to subsets of expert feed-forward layers to achieve computational efficiency. The post covers training dynamics, inference considerations, and the tradeoffs between dense and sparse models. It serves as a reference document contextualizing MoE's growing relevance following high-profile model releases using the architecture.

Training Infrastructure Frontier Model Releases Mixture of Experts Hugging Face sparse gating +1 more

4Hugging Face Blog·1mo ago·source ↗

Mixture of Experts (MoEs) in Transformers

A Hugging Face blog post covering Mixture of Experts (MoE) architectures as applied to transformer models. The post likely explains the technical foundations, training considerations, and practical deployment aspects of MoE models. Given the timing in early 2026, it likely contextualizes recent MoE-based frontier models and tooling support within the Hugging Face ecosystem.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts Hugging Face +1 more

6arXiv · cs.CL·1mo ago·source ↗

ZEDA: Post-Trained MoE Models Can Skip Half Their Experts via Self-Distillation

This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework that converts static post-trained Mixture-of-Experts (MoE) language models into dynamic ones without pre-training from scratch. ZEDA injects parameter-free zero-output experts into each MoE layer and uses two-stage self-distillation with the original model as a frozen teacher. Applied to Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA eliminates over 50% of expert FLOPs with marginal accuracy loss and achieves approximately 1.20× end-to-end inference speedup, outperforming the strongest dynamic MoE baseline by 4–6 points.

Training Infrastructure Frontier Model Releases Self-Distillation ZEDA (Zero-Expert Self-Distillation Adaptation)Qwen3-30B-A3B +3 more

7arXiv · cs.LG·26d ago·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)+3 more

7arXiv · cs.CL·24d ago·source ↗

MobileMoE: Scaling Mixture-of-Experts for Sub-Billion Parameter On-Device Deployment

MobileMoE introduces a family of on-device MoE language models with 0.3–0.9B active parameters and 1.3–5.3B total parameters, targeting mobile deployment under memory and compute constraints. The authors derive an on-device MoE scaling law identifying a sweet spot of moderate sparsity with fine-grained and shared experts, then train models through a four-stage recipe including quantization-aware training on open-source data. Across 14 benchmarks, MobileMoE matches or exceeds leading dense on-device LLMs with 2–4× fewer inference FLOPs, and delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than dense baselines on commodity smartphones at comparable INT4 memory.

Training Infrastructure Frontier Model Releases MobileLLM-Pro OLMoE-1B-7B INT4 Quantization +7 more