paper

Reinforcement Learning from Rich Feedback with Distributional DAgger

paperactiveprovisionalreinforcement-learning-from-rich-feedback-with-distributional-dagger-687e5da7·1 events·first seen 13d ago

Aliases: Reinforcement Learning from Rich Feedback with Distributional DAgger

Co-occurring entities

DAgger DistIL Reinforcement Learning with Verifiable Rewards

More like this (12)

Recent events (1)

6arXiv · cs.AI·13d ago·source ↗

DistIL: Distributional DAgger for RL from Rich Feedback beyond single-bit rewards

A new arXiv preprint introduces DistIL, a distributional variant of the DAgger imitation learning algorithm designed to exploit rich feedback signals (execution traces, tool outputs, expert corrections) rather than the single-bit correctness reward used in standard RLVR. The method uses a forward cross-entropy objective that provides monotonic policy improvement guarantees, unlike reverse KL or Jensen-Shannon divergence objectives used in prior self-distillation approaches. Empirically, DistIL outperforms RLVR and self-distillation baselines on scientific reasoning, coding, and hard math benchmarks.

Frontier Model Releases Alignment and RLHF DAgger DistIL Reinforcement Learning with Verifiable Rewards +1 more