Almanac
paper

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

paperactiveprovisionalmopd-multi-teacher-on-policy-distillation-for-capability-integration-in-llm-post-training-16780889·1 events·first seen 15h ago

Aliases: MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·15h ago·source ↗

MOPD: Multi-Teacher On-Policy Distillation for integrating multiple RL-trained capabilities in LLMs

Researchers propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm that first trains domain-specialized RL teacher models, then distills them into a student model using on-policy rollouts to eliminate exposure bias. Evaluated on Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines while preserving nearly all per-domain capability. The method has been deployed in production for MiMo-V2-Flash, an industrial-scale frontier model, validating its practical applicability. The approach also enables parallel, decoupled development of domain teachers, reducing cross-domain interference in multi-capability post-training.