Entity · technique

LLM Pretraining

techniqueactivellm-pretraining-4219fed5·1 events·first seen May 25, 2026

Aliases: LLM Pretraining

Co-occurring entities

knowledge distillation Language Modeling Loss Weak-to-Strong Distillation

More like this (12)

Dep-LLM LLM Agent Classroom SpeechLLM StreamingLLM LLM Wiki MyMentorLLM LLM-TPU train-llm-from-scratch EvalLLM MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training LLM Gateway LLM-as-a-Coach: Experiential Learning for Non-Verifiable Tasks

Recent events (1)

6arXiv · cs.LG·May 25, 2026·source ↗

Strong Teacher Not Needed? On Distillation in LLM Pretraining

This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.

Training Infrastructure Frontier Model Releases knowledge distillation Language Modeling Loss Weak-to-Strong Distillation +2 more