Entity · technique

on-policy self-distillation

techniqueactiveon-policy-self-distillation-6b9c43a2·5 events·first seen May 19, 2026

Aliases: on-policy self-distillation, Self-Policy Distillation (SPD), Neuron On-Policy Self-Distillation

Co-occurring entities

β-OPSD return-to-go credit assignment Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation GRPO Pointwise Mutual Information Purified OPSD: On-Policy Self-Distillation Without Losing How to Think large language models key-value (KV) activation projection low-rank subspace projection Multimodal Large Language Models Thinking-with-Images Vision-OPD regional-to-global perception gap

More like this (12)

on-policy distillation On-Policy Co-Distillation On-Policy Distillation (OPD)Learning from the Self-future: On-policy Self-distillation for dLLMs Purified OPSD: On-Policy Self-Distillation Without Losing How to Think On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity Self-Distillation Routing-based On-Policy Distillation Weak-to-Strong Generalization via Direct On-Policy Distillation Pass the Baton: Trajectory-Relayed On-Policy Distillation Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation Multi-Teacher On-Policy Distillation

Recent events (5)

5arXiv · cs.LG·41h ago·source ↗

β-OPSD: Generalizing on-policy self-distillation via policy optimization equivalence

This paper identifies that vanilla on-policy self-distillation (OPSD) is a special case (β=1) of a broader policy-optimization family parameterized by a KL penalty weight β. The authors derive β-OPSD, whose optimal policy is a geometric interpolation between a reference policy and a privileged teacher, and implement it efficiently by mixing token-level logits rather than running full RL. Experiments on mathematical reasoning benchmarks show β-OPSD improves optimization stability and downstream performance over vanilla OPSD.

Evaluation and Benchmarking Alignment and RLHF β-OPSD return-to-go credit assignment on-policy self-distillation

5arXiv · cs.LG·Jul 3, 2026·source ↗

Neuron-OPSD: Annotation-free LLM self-distillation guided by internal neuron activations

Researchers introduce Neuron On-Policy Self-Distillation (Neuron-OPSD), a data-centric framework for post-training LLMs without human annotations or real-world interaction feedback. The method uses internal neuron activations to guide training-data selection and teacher context construction, then trains via on-policy distillation from the teacher distribution. Evaluated on specialized-domain benchmarks, Neuron-OPSD improves in-domain performance while preserving cross-domain generalization and avoiding the calibration collapse seen in prior SFT-, GRPO-, and reward-RL-based annotation-free approaches.

Evaluation and Benchmarking Alignment and RLHF Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation GRPO on-policy self-distillation

6arXiv · cs.AI·Jul 3, 2026·source ↗

Purified OPSD fixes on-policy self-distillation failures in long chain-of-thought reasoning models

A new arXiv preprint identifies why on-policy self-distillation (OPSD) consistently degrades long chain-of-thought reasoning models: the teacher's supervision signal is dominated by reference-induced shortcuts rather than question-conditioned, transferable corrections. The authors propose a two-step fix using a reference-only teacher to isolate the non-transferable component and pointwise mutual information (PMI) to construct a cleaner distillation target. Experiments across four long-CoT models on two datasets show consistent improvements over both the base model and standard OPSD while preserving reflective reasoning behavior.

Alignment and RLHF on-policy self-distillation Pointwise Mutual Information Purified OPSD: On-Policy Self-Distillation Without Losing How to Think

6arXiv · cs.CL·May 22, 2026·source ↗

Self-Policy Distillation via Capability-Selective Subspace Projection

This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.

Frontier Model Releases Evaluation and Benchmarking large language models key-value (KV) activation projection low-rank subspace projection +2 more

6arXiv · cs.CL·May 19, 2026·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more