Entity · paper

Learning from the Self-future: On-policy Self-distillation for dLLMs

paperactivelearning-from-the-self-future-on-policy-self-distillation-for-dllms-ad5db843·1 events·first seen Jun 17, 2026

Aliases: Learning from the Self-future: On-policy Self-distillation for dLLMs

Co-occurring entities

d-OPSD

More like this (12)

on-policy self-distillation Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs on-policy distillation Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback Future Confidence Distillation in Large Language Models Weak-to-Strong Generalization via Direct On-Policy Distillation Teaching LLMs to Self-Evolve: Cultivating Core Meta-Skills with Reinforcement Learning On-Policy Distillation for LLM Safety: A Routing Approach to Template-Robust Realignment ExpRL: Exploratory RL for LLM Mid-Training

Recent events (1)

5arXiv · cs.CL·Jun 17, 2026·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

Frontier Model Releases Alignment and RLHF d-OPSD Learning from the Self-future: On-policy Self-distillation for dLLMs