technique

DOPD

techniqueactiveprovisionaldopd-a6cdb229·1 events·first seen 12h ago

Aliases: DOPD

More like this (12)

DPPO DDPO d-OPSD DPO DDPM Progressive OPD (POPD)DPOT DanceOPD OPSD Truncated OPD (TOPD)DAPO DoRA

Recent events (1)

5arXiv · cs.AI·12h ago·source ↗

DOPD: Advantage-aware dual on-policy distillation to address privilege illusion in LLM/VLM training

Researchers introduce DOPD (Dual On-policy Distillation), a knowledge distillation framework that dynamically routes token-level supervision between a privileged teacher and privileged student policy based on advantage gap and relative probabilities. The method addresses a failure mode called 'privilege illusion,' where information asymmetry between teacher and student is conflated with a transferable capability gap. Experiments on both LLM and VLM settings show DOPD outperforms vanilla on-policy distillation and related methods, with additional gains on stability, continual learning, and out-of-distribution tasks.

Open Weights Progress Alignment and RLHF DOPD