What it is
Knowledge distillation is a model compression technique in which a smaller "student" model is trained to reproduce the behavior of a larger, more capable "teacher" model. Rather than learning from raw ground-truth labels alone, the student learns from the teacher's output distribution — its soft predictions, logits, or intermediate representations — which carry richer signal about the teacher's internal knowledge. The result is a model that is structurally smaller and faster to run, but retains much of the teacher's learned capability.
The technique is distinct from other compression approaches (quantization, pruning, LoRA) in that it produces a genuinely new, smaller model rather than modifying an existing one in place. This makes it the natural choice when the deployment target has hard constraints on size, latency, or hardware.
How it works
The core training loop adds a distillation loss term alongside the standard task loss. The student is penalized not just for wrong answers but for diverging from the teacher's probability distribution over outputs — a signal that encodes the teacher's uncertainty and its implicit knowledge of which wrong answers are "less wrong." The teacher's weights are frozen throughout; only the student trains.
In practice, several variants have emerged:
- Output distillation: student matches the teacher's final logits or soft labels.
- Feature distillation: student matches intermediate layer activations, transferring representational structure.
- Data-free / synthetic distillation: teacher generates synthetic training data (e.g., via prompting or sampling) that the student then trains on — the pattern underlying so-called "distillation attacks" against proprietary LLMs.
A key implementation detail for in-context teacher models (such as tabular foundation models) is context leakage: if the teacher sees the same examples it labels, the student learns to mimic overfitting. Stratified out-of-fold teacher labeling addresses this and is now considered best practice in structured-data settings.
Why it matters
Distillation is the primary mechanism by which frontier-scale capability becomes economically deployable. A model that costs hundreds of dollars per million tokens to run at frontier scale can, after distillation, be served at a fraction of that cost on commodity hardware. The healthcare tabular benchmark results — 90%+ AUC retention at 26× CPU speedup — illustrate the practical tradeoff: near-frontier quality at inference costs that make production deployment viable.
For enterprises, the pattern is now well-established: use a large LLM to generate labels, reasoning traces, or synthetic examples for a task-specific dataset, then fine-tune a compact model on that data. Capital Fund Management's financial-domain case study is a representative example of this production pattern.
For the open-source ecosystem, Hugging Face's release of distillation code and weights for SD-Small and SD-Tiny established a reproducible reference for diffusion model compression, lowering the barrier for community experimentation.
Challenging the strong-teacher assumption
A persistent intuition in distillation is that a better teacher always produces a better student. Recent research directly challenges this. Systematic variation of architecture sizes and training token budgets shows that even small, undertrained teachers can improve larger student models — provided the distillation loss is properly mixed with the standard language modeling loss. Counterintuitively, stronger teachers can saturate or even reverse distillation gains, and distillation appears to benefit generalization more than in-domain fitting. This has practical implications: practitioners should not assume that the largest available teacher is the right choice, and loss mixing ratios deserve careful tuning.
The geopolitical dimension: distillation attacks
Distillation has acquired a second meaning in the policy and security literature: the use of large-scale, unauthorized API queries to harvest a proprietary model's outputs as training data for a competing model. Anthropic publicly attributed exactly this pattern to three Chinese AI laboratories — DeepSeek, Moonshot AI, and MiniMax — alleging over 16 million exchanges generated through approximately 24,000 fraudulent accounts in violation of terms of service. MiniMax alone was attributed more than 13 million exchanges. The targeted capabilities included agentic reasoning, tool use, coding, and chain-of-thought generation — the most differentiated and safety-relevant behaviors of the Claude model family.
Anthropic frames this not merely as a terms-of-service violation but as a national security concern: illicitly distilled models, it argues, strip out the safety safeguards embedded in the original and undermine US export controls. Commentary from Interconnects (Nathan Lambert) situates the claim within ongoing debate about how much of Chinese LLM progress is actually explained by distillation versus independent development — a question the events bundle does not resolve.
Privacy-preserving distillation
A related but distinct application is distillation as a privacy mechanism: training a student on a teacher's outputs rather than on raw private data limits direct exposure of sensitive records. OpenAI explored this pattern as early as 2016 in work on semi-supervised knowledge transfer from private training data. The approach remains relevant wherever data governance constraints make direct training on sensitive datasets impractical — healthcare being the canonical domain.
Tradeoffs and when not to use it
Distillation is not always the right tool. It requires a capable teacher to exist and be queryable, adds a training pipeline step, and introduces hyperparameters (loss mixing ratio, temperature, which layers to match) that require tuning. When the teacher is only marginally better than the target student size, the gains may not justify the overhead — and as recent research shows, a very strong teacher is not guaranteed to help. For cases where the goal is task adaptation rather than size reduction, LoRA or full fine-tuning on task data is simpler. For cases where the goal is purely inference speedup without retraining, quantization or pruning may suffice.
The illicit-distillation context adds another consideration for model providers: large-scale synthetic data generation via API is now a recognized attack surface, and rate limiting, account verification, and behavioral anomaly detection are becoming standard defenses.




