Almanac
technique

Weak-to-Strong Distillation

techniqueactiveprovisionalweak-to-strong-distillation-1114801d·1 events·first seen 23d ago

Aliases: Weak-to-Strong Distillation

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·23d ago·source ↗

Strong Teacher Not Needed? On Distillation in LLM Pretraining

This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.