technique
Language Modeling Loss
techniqueactiveprovisional
language-modeling-loss-b64c3b90·1 events·first seen 22d agoAliases: Language Modeling Loss
Co-occurring entities
More like this (12)
Language Model FinetuningLanguage Model Safety Monitorunsupervised language modelingTransformer Language ModelsLanguage Models are Few-Shot LearnersArithmetic Pedagogy for Language ModelsReinforcement Learning for Language ModelsReasoning Language Modelsgenerative language modelingAnyLanguageModelScaling Laws for Neural Language ModelsLatent Context Language Models
Recent events (1)
Strong Teacher Not Needed? On Distillation in LLM Pretraining
This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.