Neuron-OPSD: Annotation-free LLM self-distillation guided by internal neuron activations
Researchers introduce Neuron On-Policy Self-Distillation (Neuron-OPSD), a data-centric framework for post-training LLMs without human annotations or real-world interaction feedback. The method uses internal neuron activations to guide training-data selection and teacher context construction, then trains via on-policy distillation from the teacher distribution. Evaluated on specialized-domain benchmarks, Neuron-OPSD improves in-domain performance while preserving cross-domain generalization and avoiding the calibration collapse seen in prior SFT-, GRPO-, and reward-RL-based annotation-free approaches.
Related guides (3)
Related events (8)
d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs
Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.
DemoPSD: Disagreement-modulated policy self-distillation to fix privileged information leakage in LLM reasoning training
DemoPSD is a new training framework for LLMs that addresses two failure modes in on-policy self-distillation (OPSD): overfitting to in-domain patterns and privileged information leakage, where the student model learns answer-dependent shortcuts unavailable at test time. The method steers the student toward a reverse-KL barycenter target — a weighted geometric blend of teacher and student distributions — with token-level blending weights derived from the disagreement between the two distributions. Experiments on SciKnowEval across four scientific domains show DemoPSD outperforms GRPO and SDPO while maintaining higher training entropy and generalizing to out-of-distribution GPQA benchmarks.
Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
MOPD: Multi-Teacher On-Policy Distillation for integrating multiple RL-trained capabilities in LLMs
Researchers propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm that first trains domain-specialized RL teacher models, then distills them into a student model using on-policy rollouts to eliminate exposure bias. Evaluated on Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines while preserving nearly all per-domain capability. The method has been deployed in production for MiMo-V2-Flash, an industrial-scale frontier model, validating its practical applicability. The approach also enables parallel, decoupled development of domain teachers, reducing cross-domain interference in multi-capability post-training.
Purified OPSD fixes on-policy self-distillation failures in long chain-of-thought reasoning models
A new arXiv preprint identifies why on-policy self-distillation (OPSD) consistently degrades long chain-of-thought reasoning models: the teacher's supervision signal is dominated by reference-induced shortcuts rather than question-conditioned, transferable corrections. The authors propose a two-step fix using a reference-only teacher to isolate the non-transferable component and pointwise mutual information (PMI) to construct a cleaner distillation target. Experiments across four long-CoT models on two datasets show consistent improvements over both the base model and standard OPSD while preserving reflective reasoning behavior.
Self-Policy Distillation via Capability-Selective Subspace Projection
This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.
Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates
A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.
Skill-Conditioned Gated Self-Distillation (SGSD) for LLM Reasoning
SGSD is a new on-policy self-distillation method for LLM reasoning that replaces trusted privileged information (e.g., reference answers) with an experience-derived skill bank of skill-mistake pairs. It constructs a multi-teacher pool, validates each teacher's contribution via a verifier, and applies a gated objective to distill informative disagreements while suppressing noisy signals. On Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and answer-conditioned OPSD by 1.7% on average across AIME24, AIME25, and HMMT25. The method relaxes the assumption of trusted privileged information, making self-distillation more practical under weaker supervision.


