MOPD: Multi-Teacher On-Policy Distillation for integrating multiple RL-trained capabilities in LLMs
Researchers propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm that first trains domain-specialized RL teacher models, then distills them into a student model using on-policy rollouts to eliminate exposure bias. Evaluated on Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines while preserving nearly all per-domain capability. The method has been deployed in production for MiMo-V2-Flash, an industrial-scale frontier model, validating its practical applicability. The approach also enables parallel, decoupled development of domain teachers, reducing cross-domain interference in multi-capability post-training.
Related guides (2)
Related events (8)
DOPD: Advantage-aware dual on-policy distillation to address privilege illusion in LLM/VLM training
Researchers introduce DOPD (Dual On-policy Distillation), a knowledge distillation framework that dynamically routes token-level supervision between a privileged teacher and privileged student policy based on advantage gap and relative probabilities. The method addresses a failure mode called 'privilege illusion,' where information asymmetry between teacher and student is conflated with a transferable capability gap. Experiments on both LLM and VLM settings show DOPD outperforms vanilla on-policy distillation and related methods, with additional gains on stability, continual learning, and out-of-distribution tasks.
OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback
Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.
Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency
This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.
Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs
Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.
On-policy self-distillation reduces output diversity compared to RL, paper shows
A new arXiv paper analyzes on-policy self-distillation, where a single model serves as both teacher and student conditioned on correct demonstrations, finding it achieves strong pass@1 accuracy but at the cost of reduced rollout diversity and flattened pass@k curves. The authors trace this to compounding biases: teacher feedback is channeled through the model's own biases, amplifying probability mass on already-dominant modes rather than preserving diversity across equally correct solutions. Theoretical analysis shows the self-distillation policy tilts the base distribution by pointwise conditional mutual information, unlike ideal on-policy RL which preserves probability ratios among correct rollouts. Empirical results on graph path-finding and science QA benchmarks confirm self-distilled models match RL on average performance but fail on out-of-distribution settings requiring diverse strategies.
d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs
Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.
ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models
Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.
DRPO: Smooth divergence regularization replaces hard masking in LLM RL training
A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

