paper
Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
paperactiveprovisional
purified-opsd-on-policy-self-distillation-without-losing-how-to-think-b83707ee·1 events·first seen 12h agoAliases: Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Co-occurring entities
More like this (12)
On-Policy Distillation (OPD)on-policy self-distillationOn-Policy Co-DistillationOn-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversityon-policy distillationMulti-Teacher On-Policy DistillationLearning from the Self-future: On-policy Self-distillation for dLLMsCanonical-Context On-Policy Distillation (CCOPD)Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy DistillationSkill-Conditioned Gated Self-Distillation (SGSD)Rubric-Conditioned Self-DistillationSelf-Distillation
Recent events (1)
Purified OPSD fixes on-policy self-distillation failures in long chain-of-thought reasoning models
A new arXiv preprint identifies why on-policy self-distillation (OPSD) consistently degrades long chain-of-thought reasoning models: the teacher's supervision signal is dominated by reference-induced shortcuts rather than question-conditioned, transferable corrections. The authors propose a two-step fix using a reference-only teacher to isolate the non-transferable component and pointwise mutual information (PMI) to construct a cleaner distillation target. Experiments across four long-CoT models on two datasets show consistent improvements over both the base model and standard OPSD while preserving reflective reasoning behavior.