Entity · paper

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

paperactivedense-supervision-sparse-updates-on-the-sparsity-and-geometry-of-on-policy-distillation-a3825d6c·1 events·first seen Jun 12, 2026

Aliases: Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Co-occurring entities

on-policy distillation AdamW

More like this (12)

on-policy distillation Weak-to-Strong Generalization via Direct On-Policy Distillation On-Policy Distillation (OPD)Rethinking Classifier-Free Guidance in On-Policy Diffusion Distillation Rethinking Classifier-Free Guidance in On-Policy Diffusion Distillation Routing-based On-Policy Distillation Pass the Baton: Trajectory-Relayed On-Policy Distillation Multi-Teacher On-Policy Distillation Learning from the Self-future: On-policy Self-distillation for dLLMs On-Policy Co-Distillation on-policy self-distillation The Physics of Multi-Turn Long-Horizon Planning: From Pre-training to Post-training via Single- and Multi-Teacher On-Policy Agentic Distillation

Recent events (1)

5arXiv · cs.LG·Jun 12, 2026·source ↗

Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates

A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.

Evaluation and Benchmarking Alignment and RLHF on-policy distillation AdamW Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation