6arXiv cs.CL (Computation and Language)·14h ago

MOPD: Multi-Teacher On-Policy Distillation for integrating multiple RL-trained capabilities in LLMs

Researchers propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm that first trains domain-specialized RL teacher models, then distills them into a student model using on-policy rollouts to eliminate exposure bias. Evaluated on Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines while preserving nearly all per-domain capability. The method has been deployed in production for MiMo-V2-Flash, an industrial-scale frontier model, validating its practical applicability. The approach also enables parallel, decoupled development of domain teachers, reducing cross-domain interference in multi-capability post-training.

Frontier Model Releases Alignment and RLHF Qwen3-30B-A3B MiMo-V2-Flash Multi-Teacher On-Policy Distillation MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

Related guides (2)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·14h ago·source ↗

DOPD: Advantage-aware dual on-policy distillation to address privilege illusion in LLM/VLM training

Researchers introduce DOPD (Dual On-policy Distillation), a knowledge distillation framework that dynamically routes token-level supervision between a privileged teacher and privileged student policy based on advantage gap and relative probabilities. The method addresses a failure mode called 'privilege illusion,' where information asymmetry between teacher and student is conflated with a transferable capability gap. Experiments on both LLM and VLM settings show DOPD outperforms vanilla on-policy distillation and related methods, with additional gains on stability, continual learning, and out-of-distribution tasks.

Open Weights Progress Alignment and RLHF DOPD

5arXiv · cs.CL·15d ago·source ↗

OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.

Evaluation and Benchmarking Alignment and RLHF On-Policy Co-Distillation Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

6arXiv · cs.CL·1mo ago·source ↗

Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency

This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.

Evaluation and Benchmarking Agent and Tool Ecosystem on-policy distillation multi-turn language models self-anchored drift +2 more

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more

6arXiv · cs.AI·5d ago·source ↗

On-policy self-distillation reduces output diversity compared to RL, paper shows

A new arXiv paper analyzes on-policy self-distillation, where a single model serves as both teacher and student conditioned on correct demonstrations, finding it achieves strong pass@1 accuracy but at the cost of reduced rollout diversity and flattened pass@k curves. The authors trace this to compounding biases: teacher feedback is channeled through the model's own biases, amplifying probability mass on already-dominant modes rather than preserving diversity across equally correct solutions. Theoretical analysis shows the self-distillation policy tilts the base distribution by pointwise conditional mutual information, unlike ideal on-policy RL which preserves probability ratios among correct rollouts. Empirical results on graph path-finding and science QA benchmarks confirm self-distilled models match RL on average performance but fail on out-of-distribution settings requiring diverse strategies.

Evaluation and Benchmarking Alignment and RLHF On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

5arXiv · cs.CL·13d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

Frontier Model Releases Alignment and RLHF d-OPSD Learning from the Self-future: On-policy Self-distillation for dLLMs

6arXiv · cs.CL·13d ago·source ↗

ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models

Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.

Open Weights Progress Alignment and RLHF GRPO Proximal Policy Optimization Qwen3 +1 more

5arXiv · cs.LG·21d ago·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

Alignment and RLHF Divergence Regularized Policy Optimization GRPO PPO +1 more