Almanac
technique

On-Policy Distillation (OPD)

techniqueactiveprovisionalon-policy-distillation-opd--ed538534·2 events·first seen 21d ago

Aliases: On-Policy Distillation (OPD)

Co-occurring entities

More like this (12)

Recent events (2)

6arXiv · cs.CL·16d ago·source ↗

Are Full Rollouts Necessary for On-Policy Distillation?

This paper investigates whether full rollouts are required during on-policy distillation (OPD) for training reasoning models, identifying rollout horizon as a key computational bottleneck. The authors propose two strategies: Progressive OPD (POPD), which gradually expands rollout horizon during training, and Truncated OPD (TOPD), which uses permanently truncated rollouts. Experiments on mathematical reasoning show POPD achieves up to 3× training efficiency improvement, while TOPD matches full OPD performance using only 10% of the rollout horizon, yielding significant wall-clock and memory savings.

6arXiv · cs.CL·21d ago·source ↗

Pair-In, Pair-Out (PIPO): Unified Latent Compression and Multi-Token Prediction for Efficient LLM Inference

PIPO is a new inference efficiency framework that unifies input-side latent compression with output-side multi-token prediction (MTP) by treating them as mirror operations: a compressor folds two input tokens into one latent, while an MTP head unfolds one hidden state into an additional output token. To avoid the expensive verifier pass typically required by speculative decoding, PIPO trains a lightweight confidence head using On-Policy Distillation (OPD), which naturally aligns with rejection-sampling criteria. Experiments on Qwen3.5-4B and 9B backbones across AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 show up to 2.64× first-token-latency speedup and +7.15 pass@4 improvement over regular decoding.