Almanac
paper

Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

paperactiveprovisionalbe-my-tutor-on-policy-co-distillation-for-mutual-llm-improvement-via-peer-feedback-15959e09·1 events·first seen 2d ago

Aliases: Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·2d ago·source ↗

OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.