Almanac
technique

On-Policy Co-Distillation

techniqueactiveprovisionalon-policy-co-distillation-82ee8218·1 events·first seen 2d ago

Aliases: On-Policy Co-Distillation

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·2d ago·source ↗

OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.