paper
Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
paperactiveprovisional
dense-supervision-sparse-updates-on-the-sparsity-and-geometry-of-on-policy-distillation-a3825d6c·1 events·first seen 5d agoAliases: Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
Co-occurring entities
More like this (12)
on-policy distillationOn-Policy Distillation (OPD)Learning from the Self-future: On-policy Self-distillation for dLLMsOn-Policy Co-Distillationon-policy self-distillationUnsupervised Continual Clustering via Forward-Backward Knowledge DistillationContinual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMsCanonical-Context On-Policy Distillation (CCOPD)Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?In-context Vector DistillationA Unifying Lens on Supervised Fine-Tuning Through Target Distribution DesignBe My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback
Recent events (1)
Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates
A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.