MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training
mopd-multi-teacher-on-policy-distillation-for-capability-integration-in-llm-post-training-16780889·1 events·first seen 15h agoAliases: MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training
Co-occurring entities
More like this (12)
Recent events (1)
MOPD: Multi-Teacher On-Policy Distillation for integrating multiple RL-trained capabilities in LLMs
Researchers propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm that first trains domain-specialized RL teacher models, then distills them into a student model using on-policy rollouts to eliminate exposure bias. Evaluated on Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines while preserving nearly all per-domain capability. The method has been deployed in production for MiMo-V2-Flash, an industrial-scale frontier model, validating its practical applicability. The approach also enables parallel, decoupled development of domain teachers, reducing cross-domain interference in multi-capability post-training.