What is Mixture of Experts?
Imagine a hospital where, instead of one generalist doctor seeing every patient, a triage nurse quickly decides which specialist — cardiologist, neurologist, surgeon — is best suited for each case. Only one or two specialists do the actual work; the rest stay ready but idle. Mixture of Experts (MoE) works the same way inside an AI model.
A standard AI model (called a "dense" model) runs every piece of text through all of its internal components every time. MoE models instead contain many parallel sub-networks called experts, plus a lightweight router that reads each incoming chunk of text (a "token") and decides which two or three experts should handle it. The rest sit unused. The result: a model that can be enormous in total size — storing vast, specialised knowledge — while only activating a small fraction of that size for any given input.
Why should you care?
The practical payoff is speed and cost. DeepSeek-V3, for example, has 671 billion total parameters — a staggering number — but activates only 37 billion per token. It runs at 60 tokens per second and costs just $0.27 per million input tokens. A dense model of equivalent quality would be far slower and more expensive to serve.
This efficiency unlocks things that were previously impractical:
- Frontier-quality models at low-cost APIs — multiple labs now offer MoE-based models at prices that make large-scale use affordable.
- Capable models on your phone — MobileMoE fits a MoE model into 0.3–0.9 billion active parameters, running 1.8–3.8× faster than comparable dense models on ordinary smartphones.
- One model, many skills — Mistral Small 4 packs reasoning, image understanding, and coding into a single 119B-parameter MoE (only 6B active per token), replacing three separate specialist models.
How the router works (simply)
When text arrives, the router scores each expert and picks the top two or three. Those experts process the text and their outputs are blended together. The router is trained alongside the experts, so over time it learns which experts are good at which kinds of content. Research from AllenAI (EMO) explores how this can lead to genuine emergent specialisation — experts that naturally gravitate toward different domains — without anyone explicitly programming that division of labour.
The landscape today
MoE has gone from a research curiosity to the dominant architecture for large open-weights models in just a few years. The December 2023 release of Mixtral — which used 8 experts with 2 active per token and matched much larger dense models — was a turning point that sparked a wave of MoE releases across Alibaba's Qwen family, DeepSeek, Google's Gemini line, and others.
The technique has also spread well beyond language. SegMoE applies it to image-generation diffusion models. The HANDOFF robotics paper uses a MoE student trained from multiple specialist teachers to control a humanoid robot's whole body. Thinking Machines Lab's TML-Interaction-Small is a 276B-parameter MoE that processes audio, video, and text simultaneously in near-real-time.
What researchers are still working on
MoE introduces its own engineering headaches, and the field is actively addressing them:
- Load balancing: If the router always sends tokens to the same popular experts, others go underused and training becomes inefficient. Qwen Research published a "global-batch load balancing" technique they describe as nearly a free improvement.
- Dynamic expert skipping: ZEDA showed that a post-trained MoE model can skip over 50% of its expert computations with only marginal accuracy loss, giving roughly a 1.2× inference speedup without retraining from scratch.
- Power and energy: PALS, a power-aware serving runtime for vLLM, treats GPU power caps as a scheduling variable alongside batch size, achieving up to 26.3% energy efficiency gains for MoE deployments.
- Hyperparameter transfer: Complete-muE provides a framework for tuning a dense model once and reliably transferring those settings to a MoE version — cutting the expensive trial-and-error of finding the right training configuration.
The bottom line
MoE is the answer to a real tension in AI: bigger models are more capable, but bigger models are more expensive to run. By activating only the relevant parts of a model for each input, MoE lets developers build very large, knowledgeable systems while keeping inference fast and affordable. It's now the standard approach for anyone building at the frontier — and increasingly, for anyone building for the edge.




