What it is
Knowledge distillation is a way of making AI models smaller and faster without losing too much of what makes them smart. The idea: take a large, expensive "teacher" model that performs well, and use its outputs to train a much smaller "student" model. The student learns not just the right answers, but the teacher's confidence patterns — subtle signals that carry more information than a simple label ever could.
Think of it like an apprentice watching a master chef. The apprentice doesn't just memorize recipes; they pick up on the chef's instincts — when to add a pinch more salt, when to trust the smell. The result is a cook who performs far above what raw recipe-following would produce.
Why should you care?
Running a frontier AI model is expensive. The biggest models require powerful server hardware and cost real money per query. Distillation is how AI gets out of the data center and into your app, your phone, or your hospital's CPU. A recent study on healthcare data showed distilled student models retaining at least 90% of the teacher's accuracy while running 26 times faster on a standard CPU — that's the difference between a tool you can afford to deploy and one you can't.
Hugging Face demonstrated this concretely by open-sourcing distilled versions of Stable Diffusion (called SD-Small and SD-Tiny) — smaller, faster image-generation models that anyone can run or build on. Capital Fund Management used a similar pattern in finance: they had a large language model generate training signals, then used those to fine-tune a compact model for production, cutting cost and latency.
How it works (the basics)
The process has three steps:
1. Run the teacher. Feed your training data through the large model and collect its outputs — not just "the answer," but the full probability distribution it assigns across possible answers. 2. Train the student. Train a smaller model to match those probability distributions, not just the final labels. This is the key insight: the teacher's soft outputs are richer than hard labels. 3. Deploy the student. The student is the one that goes into production — small, fast, cheap.
A 2026 research paper added a useful wrinkle: you don't always need a stronger teacher to get a better student. Even a small, undertrained teacher can improve a larger student when the distillation and standard training losses are mixed in the right proportions. Counterintuitively, a very strong teacher can sometimes hurt the student by overwhelming it with signals it can't yet absorb.
The controversy: distillation as a competitive weapon
Distillation has a darker side. Because it works by querying a model and learning from its outputs, it can be used to copy a model's capabilities without access to its weights or training data — just its API.
Anthropic publicly accused three Chinese AI labs — DeepSeek, Moonshot AI, and MiniMax — of doing exactly this at industrial scale: generating over 16 million exchanges through roughly 24,000 fraudulent accounts to extract Claude's capabilities in areas like agentic reasoning, coding, and chain-of-thought generation. Anthropic framed this not just as a terms-of-service violation but as a national security concern, arguing that illicitly distilled models also strip out the safety guardrails built into the original.
Commentary from the AI research community has pushed back on how much of Chinese LLM progress can actually be attributed to distillation attacks versus independent research — the picture is genuinely contested. But the episode illustrates that distillation, as a technique, sits at the intersection of efficiency engineering and competitive intelligence.
Where it's heading
Distillation is already standard practice for getting AI into production. The open questions are about its limits and its governance. On the technical side, researchers are still mapping when distillation helps versus hurts, and how to handle specialized domains like healthcare without leaking sensitive data. On the policy side, the "distillation attack" framing signals that how you use distillation — and whose model you distill from — is becoming a legal and geopolitical question, not just an engineering one.




