paper

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

paperactiveprovisionalon-policy-self-distillation-with-sampled-demonstrations-reduces-output-diversity-54fa6f65·1 events·first seen 3d ago

Aliases: On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

More like this (12)

on-policy self-distillation On-Policy Co-Distillation Learning from the Self-future: On-policy Self-distillation for dLLMs on-policy distillation On-Policy Distillation (OPD)Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation Multi-Teacher On-Policy Distillation The Role of Feedback Alignment in Self-Distillation Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback LESS: Mutual-Stability Sampling for Diffusion Language Models What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?Rubric-Conditioned Self-Distillation

Recent events (1)

6arXiv · cs.AI·3d ago·source ↗

On-policy self-distillation reduces output diversity compared to RL, paper shows

A new arXiv paper analyzes on-policy self-distillation, where a single model serves as both teacher and student conditioned on correct demonstrations, finding it achieves strong pass@1 accuracy but at the cost of reduced rollout diversity and flattened pass@k curves. The authors trace this to compounding biases: teacher feedback is channeled through the model's own biases, amplifying probability mass on already-dominant modes rather than preserving diversity across equally correct solutions. Theoretical analysis shows the self-distillation policy tilts the base distribution by pointwise conditional mutual information, unlike ideal on-policy RL which preserves probability ratios among correct rollouts. Empirical results on graph path-finding and science QA benchmarks confirm self-distilled models match RL on average performance but fail on out-of-distribution settings requiring diverse strategies.

Evaluation and Benchmarking Alignment and RLHF On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity