Almanac
paper

The Role of Feedback Alignment in Self-Distillation

paperactiveprovisionalthe-role-of-feedback-alignment-in-self-distillation-76918a3a·1 events·first seen 7d ago

Aliases: The Role of Feedback Alignment in Self-Distillation

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.AI·7d ago·source ↗

Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation

A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.