technique

Weak-to-Strong Distillation

techniqueactiveprovisionalweak-to-strong-distillation-1114801d·1 events·first seen 23d ago

Aliases: Weak-to-Strong Distillation

Co-occurring entities

knowledge distillation Language Modeling Loss LLM Pretraining

More like this (12)

Generalized Distillation ensemble distillation Rank-to-Distill Self-Distillation Model Distillation distillation weak-to-strong generalization On-Policy Co-Distillation on-policy distillation distillation attacks distilabel On-Policy Distillation (OPD)

Recent events (1)

6arXiv · cs.LG·23d ago·source ↗

Strong Teacher Not Needed? On Distillation in LLM Pretraining

This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.

Training Infrastructure Frontier Model Releases knowledge distillation Language Modeling Loss Weak-to-Strong Distillation +2 more