Almanac
paper

Learning from the Self-future: On-policy Self-distillation for dLLMs

paperactiveprovisionallearning-from-the-self-future-on-policy-self-distillation-for-dllms-ad5db843·1 events·first seen 3h ago

Aliases: Learning from the Self-future: On-policy Self-distillation for dLLMs

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·3h ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.