Entity · technique

Language Modeling Loss

techniqueactivelanguage-modeling-loss-b64c3b90·1 events·first seen May 25, 2026

Aliases: Language Modeling Loss

Co-occurring entities

knowledge distillation Weak-to-Strong Distillation LLM Pretraining

More like this (12)

Language Model Finetuning Tapered Language Models Random Language Model Language Model Safety Monitor unsupervised language modeling Knowledge-Less Language Models Transformer Language Models LanguageModel protocol Language Models are Few-Shot Learners Knowledgeless Language Models: Suppressing Parametric Recall for Evidence-Grounded Language Modeling Arithmetic Pedagogy for Language Models Reinforcement Learning for Language Models

Recent events (1)

6arXiv · cs.LG·May 25, 2026·source ↗

Strong Teacher Not Needed? On Distillation in LLM Pretraining

This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.

Training Infrastructure Frontier Model Releases knowledge distillation Language Modeling Loss Weak-to-Strong Distillation +2 more