Almanac
paper

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

paperactiveprovisionalon-policy-self-distillation-with-sampled-demonstrations-reduces-output-diversity-54fa6f65·1 events·first seen 3d ago

Aliases: On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

More like this (12)

Recent events (1)

6arXiv · cs.AI·3d ago·source ↗

On-policy self-distillation reduces output diversity compared to RL, paper shows

A new arXiv paper analyzes on-policy self-distillation, where a single model serves as both teacher and student conditioned on correct demonstrations, finding it achieves strong pass@1 accuracy but at the cost of reduced rollout diversity and flattened pass@k curves. The authors trace this to compounding biases: teacher feedback is channeled through the model's own biases, amplifying probability mass on already-dominant modes rather than preserving diversity across equally correct solutions. Theoretical analysis shows the self-distillation policy tilts the base distribution by pointwise conditional mutual information, unlike ideal on-policy RL which preserves probability ratios among correct rollouts. Empirical results on graph path-finding and science QA benchmarks confirm self-distilled models match RL on average performance but fail on out-of-distribution settings requiring diverse strategies.