Entity · technique

on-policy distillation

techniqueactiveon-policy-distillation-2ce3730a·8 events·first seen May 29, 2026

Aliases: on-policy distillation, Direct On-Policy Distillation, On-Policy Delta Distillation, Relay On-Policy Distillation

Co-occurring entities

More like this (12)

Recent events (8)

5arXiv · cs.AI·2d ago·source ↗

Relay-OPD addresses prefix failure in on-policy distillation via teacher-student trajectory handoff

Researchers introduce Relay On-Policy Distillation (Relay-OPD), a training method that addresses 'prefix failure' in on-policy knowledge distillation, where student models compound early reasoning errors throughout a trajectory. The approach detects divergence points where teacher and student continuations asymmetrically diverge, then briefly hands generation to the teacher to produce a corrective 'relay leg' before the student resumes. Evaluated on eight mathematical reasoning benchmarks using Qwen3-4B-Instruct-2507 as teacher and Qwen3-0.6B/1.7B as students, Relay-OPD outperforms standard OPD by +5.73% and the strongest baseline FastOPD by +1.49% on average for the 1.7B model, while also reducing training trajectory length by over 50%.

Open Weights Progress Alignment and RLHF on-policy distillation Qwen3.5-0.8B Pass the Baton: Trajectory-Relayed On-Policy Distillation +3 more

6arXiv · cs.CL·3d ago·source ↗

Unified study of multi-turn long-horizon planning across pre-training and post-training stages via agentic distillation

A new arXiv preprint introduces a controlled multi-turn environment to systematically study how long-horizon planning ability is acquired, shaped, and integrated in foundation model agents across three stages: pre-training data design, post-training via GRPO and on-policy distillation (OPD), and multi-teacher on-policy distillation (MOPD). Key findings include that explicit world model construction via chain-of-thought state transition modeling improves generalization, suboptimal trajectories severely degrade performance over long horizons, and OPD outperforms GRPO in low-quality and long-horizon settings. The multi-teacher distillation analysis reveals that compatible planning patterns enable cross-environment generalization while conflicting patterns cause interference.

Frontier Model Releases Agent and Tool Ecosystem The Physics of Multi-Turn Long-Horizon Planning: From Pre-training to Post-training via Single- and Multi-Teacher On-Policy Agentic Distillation on-policy distillation GRPO +2 more

5arXiv · cs.CL·Jul 17, 2026·source ↗

On-Policy Delta Distillation improves reasoning transfer via teacher-base model difference signal

Researchers from NAVER AI introduce On-Policy Delta Distillation (OPD²), a new post-training method that replaces direct imitation of a teacher model's output distribution with a 'delta signal' — the difference between the teacher and its pre-instruction-tuning base model. This delta signal isolates the reasoning capability changes induced by instruction tuning, providing a more targeted supervision signal for student models. Experiments across math, science, and code-reasoning benchmarks show OPD² consistently outperforms conventional on-policy distillation with shorter post-training periods.

Frontier Model Releases Alignment and RLHF on-policy distillation NAVER AI

6arXiv · cs.CL·Jul 7, 2026·source ↗

Direct On-Policy Distillation transfers RL policy shifts from weak to strong models

Researchers propose Direct-OPD (Direct On-Policy Distillation), a method for transferring the policy shift induced by reinforcement learning on a small model to a larger target model, bypassing the need to run expensive RL rollouts on the stronger model. The approach uses the log-ratio between a post-RL teacher and its pre-RL reference as a dense implicit reward signal applied to the student's own on-policy states. Empirically, Direct-OPD improves Qwen3-1.7B from 48.3% to 62.4% on AIME 2024 in 4 hours on 8 A100 GPUs, outperforming step-matched direct RL. The method addresses a key scalability bottleneck in post-training as frontier models grow larger.

Training Infrastructure Frontier Model Releases on-policy distillation AIME 2026 Weak-to-Strong Generalization via Direct On-Policy Distillation +5 more

6arXiv · cs.AI·Jul 1, 2026·source ↗

GR2: Generative Reasoning Re-Ranker applies RL with verifiable rewards to industrial-scale LLM recommendation

GR2 (Generative Reasoning Re-Ranker) is a new framework that applies reinforcement learning with verifiable rewards to the re-ranking stage of industrial recommendation systems, a step largely overlooked by prior LLM-based recommendation research. The system combines semantic ID mid-training, reasoning-trace distillation from a stronger teacher model, and purpose-built RL rewards, plus a context compressor and On-Policy Distillation to make it viable at scale. Deployed on industrial traffic, GR2 achieves +18.7% R@1 and +9.6% N@3 over legacy baselines. The paper also identifies a critical reward-hacking failure mode where LLMs exploit position bias or preserve input order, motivating conditional verifiable rewards.

Inference Economics Agent and Tool Ecosystem GR2 on-policy distillation GR2 Technical Report

5arXiv · cs.LG·Jun 12, 2026·source ↗

Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates

A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.

Evaluation and Benchmarking Alignment and RLHF on-policy distillation AdamW Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

6arXiv · cs.AI·Jun 2, 2026·source ↗

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.

Evaluation and Benchmarking AI Safety Research on-policy distillation SafeSteer alignment tax +3 more

6arXiv · cs.CL·May 29, 2026·source ↗

Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency

This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.

Evaluation and Benchmarking Agent and Tool Ecosystem on-policy distillation multi-turn language models self-anchored drift +2 more