Almanac
paper

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

paperactiveprovisionaldense-supervision-sparse-updates-on-the-sparsity-and-geometry-of-on-policy-distillation-a3825d6c·1 events·first seen 5d ago

Aliases: Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.LG·5d ago·source ↗

Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates

A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.