reinforcement learning from verifier feedback
reinforcement-learning-from-verifier-feedback-e2a06f43·1 events·first seen 18d agoAliases: reinforcement learning from verifier feedback
Co-occurring entities
More like this (12)
Recent events (1)
Self-Trained Verification (STV) Unlocks Training- and Test-Time Self-Improvement for Reasoning Models
This paper introduces Self-Trained Verification (STV), a method that trains a verifier to imitate a more informed version of itself by leveraging reference solutions as supervision signal, addressing the core bottleneck in both test-time verification-refinement loops and self-training pipelines. At test time, STV roughly doubles accuracy on hard math and achieves a 14x lift on scientific reasoning tasks. At training time, the authors combine STV with RL in a procedure called Verifier-in-the-Loop (ViL) training, yielding a 33% further gain in pass@1 over an already RL-converged generator, with standalone pass@1 climbing 30% relative past standard RL convergence. The work argues that verification quality, not generation, is the primary bottleneck for scaling reasoning on hard problems.