Rubric-Conditioned Self-Distillation: structured feedback for reasoning model post-training
A new arXiv preprint proposes Rubric-Conditioned Self-Distillation (RCSD), a post-training framework that replaces scalar reward signals and noisy chain-of-thought annotations with structured rubrics for fine-grained credit assignment. The method conditions a teacher model on criterion-level rubrics to provide token-level guidance on the student's own sampled trajectories, avoiding reliance on a single reference rationale. Evaluated on science reasoning benchmarks, RCSD outperforms GRPO by 1.0 points and OPSD by 0.9 points on average.
Related guides (3)
Related events (8)
Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation
A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.
POW3R: Policy-Aware Rubric Rewards for More Efficient RLVR Training
This paper identifies a failure mode in rubric-based reinforcement learning with verifiable rewards (RLVR): static aggregation of criterion weights conflates human-assigned importance with current optimization utility, causing many criteria to be either already saturated or unreachable. The authors introduce POW3R, a framework that dynamically reweights criterion-level rewards during training using rollout-level contrast to emphasize criteria that currently differentiate policy outputs. Across three base policies and two datasets (multimodal and text-only), POW3R wins 24 of 30 comparisons on rubric reward and strict completion metrics, and reaches equivalent performance in 2.5–4× fewer training steps than vanilla GRPO with rubric rewards.
EDIT framework trains more rubric-faithful LLM graders via internal-state diagnostics
Researchers introduce Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for improving LLM-based rubric grading. The first phase (EDIT-SFT) identifies problematic reasoning steps using posterior belief signals and input-grounding scores, then revises only those steps with rubric checklists; the second phase (EDIT-RL) uses belief-guided reward shaping to penalize harmful belief drifts during RL. Experiments on two real-world multi-subject grading benchmarks show consistent improvements over SFT and RL baselines on both in-domain and out-of-domain splits.
Skill-Conditioned Gated Self-Distillation (SGSD) for LLM Reasoning
SGSD is a new on-policy self-distillation method for LLM reasoning that replaces trusted privileged information (e.g., reference answers) with an experience-derived skill bank of skill-mistake pairs. It constructs a multi-teacher pool, validates each teacher's contribution via a verifier, and applies a gated objective to distill informative disagreements while suppressing noisy signals. On Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and answer-conditioned OPSD by 1.7% on average across AIME24, AIME25, and HMMT25. The method relaxes the assumption of trusted privileged information, making self-distillation more practical under weaker supervision.
AMARIS: Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
AMARIS introduces a persistent evaluation memory system to improve rubric-based reward shaping in LLM fine-tuning via reinforcement learning. Unlike prior adaptive rubric methods that discard evaluation diagnostics after each step, AMARIS accumulates step-level summaries and retrieves relevant historical context via both static (recent steps) and dynamic (semantic similarity) retrieval to inform rubric updates. The system runs asynchronously alongside the RL training loop with approximately 5% time overhead. Experiments across closed and open-ended domains show consistent improvements over baselines, with ablations confirming that combining both retrieval modes yields the strongest results.
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
QUBRIC: Co-designing queries and rubrics for RL beyond verifiable rewards
QUBRIC is a framework that jointly optimizes queries and rubrics for reinforcement learning in settings where rewards are not strictly verifiable. The approach uses teacher-derived key points to rewrite open-ended queries into evaluable scenarios, applies contrastive rubric generation to capture teacher-policy gaps, and filters for learnability before GRPO training. Trained only on instruction-following data, QUBRIC achieves a +5.5 point gain on ArenaHard over an SFT baseline and transfers to legal, moral, and narrative reasoning benchmarks (+6.3 points average), suggesting rubric-based RL can complement RLVR in non-verifiable domains.
RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy
Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.


