On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity
on-policy-self-distillation-with-sampled-demonstrations-reduces-output-diversity-54fa6f65·1 events·first seen 3d agoAliases: On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity
More like this (12)
Recent events (1)
On-policy self-distillation reduces output diversity compared to RL, paper shows
A new arXiv paper analyzes on-policy self-distillation, where a single model serves as both teacher and student conditioned on correct demonstrations, finding it achieves strong pass@1 accuracy but at the cost of reduced rollout diversity and flattened pass@k curves. The authors trace this to compounding biases: teacher feedback is channeled through the model's own biases, amplifying probability mass on already-dominant modes rather than preserving diversity across equally correct solutions. Theoretical analysis shows the self-distillation policy tilts the base distribution by pointwise conditional mutual information, unlike ideal on-policy RL which preserves probability ratios among correct rollouts. Empirical results on graph path-finding and science QA benchmarks confirm self-distilled models match RL on average performance but fail on out-of-distribution settings requiring diverse strategies.