Entity · technique

reinforcement learning from verifier feedback

techniqueactivereinforcement-learning-from-verifier-feedback-e2a06f43·1 events·first seen May 29, 2026

Aliases: reinforcement learning from verifier feedback

Co-occurring entities

self-training Verifier-in-the-Loop Training (ViL)Self-Trained Verification (STV)verification-refinement loop

More like this (12)

Recent events (1)

7arXiv · cs.CL·May 29, 2026·source ↗

Self-Trained Verification (STV) Unlocks Training- and Test-Time Self-Improvement for Reasoning Models

This paper introduces Self-Trained Verification (STV), a method that trains a verifier to imitate a more informed version of itself by leveraging reference solutions as supervision signal, addressing the core bottleneck in both test-time verification-refinement loops and self-training pipelines. At test time, STV roughly doubles accuracy on hard math and achieves a 14x lift on scientific reasoning tasks. At training time, the authors combine STV with RL in a procedure called Verifier-in-the-Loop (ViL) training, yielding a 33% further gain in pass@1 over an already RL-converged generator, with standalone pass@1 climbing 30% relative past standard RL convergence. The work argues that verification quality, not generation, is the primary bottleneck for scaling reasoning on hard problems.

Frontier Model Releases Evaluation and Benchmarking self-training Verifier-in-the-Loop Training (ViL)Self-Trained Verification (STV)+4 more