6arXiv cs.AI (Artificial Intelligence)·12h ago

Purified OPSD fixes on-policy self-distillation failures in long chain-of-thought reasoning models

A new arXiv preprint identifies why on-policy self-distillation (OPSD) consistently degrades long chain-of-thought reasoning models: the teacher's supervision signal is dominated by reference-induced shortcuts rather than question-conditioned, transferable corrections. The authors propose a two-step fix using a reference-only teacher to isolate the non-transferable component and pointwise mutual information (PMI) to construct a cleaner distillation target. Experiments across four long-CoT models on two datasets show consistent improvements over both the base model and standard OPSD while preserving reflective reasoning behavior.

Alignment and RLHF on-policy self-distillation Pointwise Mutual Information Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

Related guides (1)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI to Do What We Actually Want

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·12h ago·source ↗

DemoPSD: Disagreement-modulated policy self-distillation to fix privileged information leakage in LLM reasoning training

DemoPSD is a new training framework for LLMs that addresses two failure modes in on-policy self-distillation (OPSD): overfitting to in-domain patterns and privileged information leakage, where the student model learns answer-dependent shortcuts unavailable at test time. The method steers the student toward a reverse-KL barycenter target — a weighted geometric blend of teacher and student distributions — with token-level blending weights derived from the disagreement between the two distributions. Experiments on SciKnowEval across four scientific domains show DemoPSD outperforms GRPO and SDPO while maintaining higher training entropy and generalizing to out-of-distribution GPQA benchmarks.

Evaluation and Benchmarking Alignment and RLHF SciKnowEval GRPO SDPO +2 more

5arXiv · cs.CL·16d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

Frontier Model Releases Alignment and RLHF d-OPSD Learning from the Self-future: On-policy Self-distillation for dLLMs

6arXiv · cs.CL·1mo ago·source ↗

Are Full Rollouts Necessary for On-Policy Distillation?

This paper investigates whether full rollouts are required during on-policy distillation (OPD) for training reasoning models, identifying rollout horizon as a key computational bottleneck. The authors propose two strategies: Progressive OPD (POPD), which gradually expands rollout horizon during training, and Truncated OPD (TOPD), which uses permanently truncated rollouts. Experiments on mathematical reasoning show POPD achieves up to 3× training efficiency improvement, while TOPD matches full OPD performance using only 10% of the rollout horizon, yielding significant wall-clock and memory savings.

Training Infrastructure Frontier Model Releases On-Policy Distillation (OPD)mathematical reasoning Truncated OPD (TOPD)+4 more

5arXiv · cs.AI·15d ago·source ↗

Rubric-Conditioned Self-Distillation: structured feedback for reasoning model post-training

A new arXiv preprint proposes Rubric-Conditioned Self-Distillation (RCSD), a post-training framework that replaces scalar reward signals and noisy chain-of-thought annotations with structured rubrics for fine-grained credit assignment. The method conditions a teacher model on criterion-level rubrics to provide token-level guidance on the student's own sampled trajectories, avoiding reliance on a single reference rationale. Evaluated on science reasoning benchmarks, RCSD outperforms GRPO by 1.0 points and OPSD by 0.9 points on average.

Evaluation and Benchmarking Alignment and RLHF OPSD GRPO Rubric-Conditioned Self-Distillation

5arXiv · cs.CL·18d ago·source ↗

OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.

Evaluation and Benchmarking Alignment and RLHF On-Policy Co-Distillation Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

6arXiv · cs.AI·1mo ago·source ↗

Skill-Conditioned Gated Self-Distillation (SGSD) for LLM Reasoning

SGSD is a new on-policy self-distillation method for LLM reasoning that replaces trusted privileged information (e.g., reference answers) with an experience-derived skill bank of skill-mistake pairs. It constructs a multi-teacher pool, validates each teacher's contribution via a verifier, and applies a gated objective to distill informative disagreements while suppressing noisy signals. On Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and answer-conditioned OPSD by 1.7% on average across AIME24, AIME25, and HMMT25. The method relaxes the assumption of trusted privileged information, making self-distillation more practical under weaker supervision.

Frontier Model Releases Evaluation and Benchmarking OPSD AIME24 SGSD +7 more

5arXiv · cs.AI·23d ago·source ↗

Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation

A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.

Evaluation and Benchmarking Alignment and RLHF GRPO The Role of Feedback Alignment in Self-Distillation

6arXiv · cs.CL·1mo ago·source ↗

Self-Policy Distillation via Capability-Selective Subspace Projection

This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.

Frontier Model Releases Evaluation and Benchmarking large language models key-value (KV) activation projection low-rank subspace projection +2 more