6arXiv cs.AI (Artificial Intelligence)·24d ago

Skill-Conditioned Gated Self-Distillation (SGSD) for LLM Reasoning

SGSD is a new on-policy self-distillation method for LLM reasoning that replaces trusted privileged information (e.g., reference answers) with an experience-derived skill bank of skill-mistake pairs. It constructs a multi-teacher pool, validates each teacher's contribution via a verifier, and applies a gated objective to distill informative disagreements while suppressing noisy signals. On Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and answer-conditioned OPSD by 1.7% on average across AIME24, AIME25, and HMMT25. The method relaxes the assumption of trusted privileged information, making self-distillation more practical under weaker supervision.

Frontier Model Releases Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF OPSD AIME24 SGSD GRPO AIME25 Skill-Conditioned Gated Self-Distillation (SGSD)Qwen3-1.7B HMMT25

Related guides (4)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Alignment and RLHFTopic guide

Alignment and RLHF: From Human Feedback to Scalable Post-Training

Read asIn-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Self-Policy Distillation via Capability-Selective Subspace Projection

This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.

Frontier Model Releases Evaluation and Benchmarking large language models key-value (KV) activation projection low-rank subspace projection +2 more

5arXiv · cs.AI·3d ago·source ↗

Rubric-Conditioned Self-Distillation: structured feedback for reasoning model post-training

A new arXiv preprint proposes Rubric-Conditioned Self-Distillation (RCSD), a post-training framework that replaces scalar reward signals and noisy chain-of-thought annotations with structured rubrics for fine-grained credit assignment. The method conditions a teacher model on criterion-level rubrics to provide token-level guidance on the student's own sampled trajectories, avoiding reliance on a single reference rationale. Evaluated on science reasoning benchmarks, RCSD outperforms GRPO by 1.0 points and OPSD by 0.9 points on average.

Evaluation and Benchmarking Alignment and RLHF OPSD GRPO Rubric-Conditioned Self-Distillation

5arXiv · cs.CL·4d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

Frontier Model Releases Alignment and RLHF d-OPSD Learning from the Self-future: On-policy Self-distillation for dLLMs

5arXiv · cs.AI·11d ago·source ↗

Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation

A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.

Evaluation and Benchmarking Alignment and RLHF GRPO The Role of Feedback Alignment in Self-Distillation

6arXiv · cs.LG·5d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL

6arXiv · cs.CL·23d ago·source ↗

Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency

This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.

Evaluation and Benchmarking Agent and Tool Ecosystem on-policy distillation multi-turn language models self-anchored drift +2 more

6arXiv · cs.CL·20d ago·source ↗

Question-Answering as Hidden State Probing for Test-Time Reasoning Intervention

This paper proposes using question-asking as an inference-time intervention to surface information about an LLM's hidden state during chain-of-thought reasoning. The authors train a probe on a student model's hidden states before and after question generation, finding it predictive of final answer correctness even before the teacher responds—suggesting self-diagnosis during question generation carries meaningful signal. They frame question-asking as a sequential decision problem with a gating policy, but find a gap between detection and recovery: interventions are as likely to harm correct trajectories as to fix incorrect ones. The results have implications for the limits of LLM self-refinement under uncertainty.

Evaluation and Benchmarking Agent and Tool Ecosystem student-teacher prompting Chain-of-Thought Reasoning inference-time intervention +4 more

5arXiv · cs.CL·6d ago·source ↗

OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.

Evaluation and Benchmarking Alignment and RLHF On-Policy Co-Distillation Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback