Almanac
technique

Reinforcement Learning with Verifiable Rewards

techniqueactivereinforcement-learning-with-verifiable-rewards-cc30a5c2·8 events·first seen 1mo ago

Aliases: Reinforcement Learning with Verifiable Rewards, Reinforcement Learning with Verifiable Rewards (RLVR), Reinforcement Learning from Verifiable Rewards, RLVR (Reinforcement Learning from Verifiable Rewards), RLVR (Reinforcement Learning with Verifiable Rewards)

Co-occurring entities

More like this (12)

Guides (1)

Recent events (8)

6arXiv · cs.CL·1mo ago·source ↗

DelTA: Discriminative Token Credit Assignment for RLVR Training

DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.

7arXiv · cs.CL·1mo ago·source ↗

RELEX: Extrapolating LLM RLVR Training via Rank-1 Parameter Trajectories

This paper demonstrates that RLVR weight update trajectories are extremely low-rank and near-linearly predictable, with a rank-1 approximation capturing most downstream performance gains. The authors propose RELEX, a compute-efficient method that observes a short training window, estimates the rank-1 subspace, and extrapolates future checkpoints via linear regression—requiring no additional training. Evaluated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, RELEX matches or exceeds full RLVR performance using as few as 15% of training steps, and can extrapolate up to 10–20× beyond the observed prefix. The authors attribute the method's effectiveness to a denoising effect from rank-1 projection that discards stochastic optimization noise.

6arXiv · cs.AI·1mo ago·source ↗

POW3R: Policy-Aware Rubric Rewards for More Efficient RLVR Training

This paper identifies a failure mode in rubric-based reinforcement learning with verifiable rewards (RLVR): static aggregation of criterion weights conflates human-assigned importance with current optimization utility, causing many criteria to be either already saturated or unreachable. The authors introduce POW3R, a framework that dynamically reweights criterion-level rewards during training using rollout-level contrast to emphasize criteria that currently differentiate policy outputs. Across three base policies and two datasets (multimodal and text-only), POW3R wins 24 of 30 comparisons on rubric reward and strict completion metrics, and reaches equivalent performance in 2.5–4× fewer training steps than vanilla GRPO with rubric rewards.

6arXiv · cs.CL·19d ago·source ↗

LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards

LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.

4Hugging Face Blog·1mo ago·source ↗

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

Hugging Face published a blog post introducing Ecom-RLVE, a framework for training e-commerce conversational agents using reinforcement learning with verifiable environments. The approach creates adaptive environments that can verify agent actions and outcomes in e-commerce contexts, enabling RL-based training signals. This represents an application of the RLVR (Reinforcement Learning with Verifiable Rewards) paradigm to a specific commercial domain.

6arXiv · cs.CL·29d ago·source ↗

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

This paper proposes a multi-reward reinforcement learning from internal feedback (RLIF) framework that decomposes training signals into an answer-level reward via cluster voting and a completion-level reward via token-wise self-certainty. To address reward hacking and entropy collapse common in single-reward RLIF, the authors introduce GDPO-based normalization and KL-Cov regularization targeting low-entropy token distributions. Evaluated on mathematical reasoning and code-generation benchmarks, the method achieves stability and performance approaching supervised RLVR methods without requiring external ground-truth supervision. The work advances scalable unsupervised RL training for LLM reasoning.

6arXiv · cs.AI·16d ago·source ↗

DistIL: Distributional DAgger for RL from Rich Feedback beyond single-bit rewards

A new arXiv preprint introduces DistIL, a distributional variant of the DAgger imitation learning algorithm designed to exploit rich feedback signals (execution traces, tool outputs, expert corrections) rather than the single-bit correctness reward used in standard RLVR. The method uses a forward cross-entropy objective that provides monotonic policy improvement guarantees, unlike reverse KL or Jensen-Shannon divergence objectives used in prior self-distillation approaches. Empirically, DistIL outperforms RLVR and self-distillation baselines on scientific reasoning, coding, and hard math benchmarks.

6arXiv · cs.CL·19d ago·source ↗

Are Full Rollouts Necessary for On-Policy Distillation?

This paper investigates whether full rollouts are required during on-policy distillation (OPD) for training reasoning models, identifying rollout horizon as a key computational bottleneck. The authors propose two strategies: Progressive OPD (POPD), which gradually expands rollout horizon during training, and Truncated OPD (TOPD), which uses permanently truncated rollouts. Experiments on mathematical reasoning show POPD achieves up to 3× training efficiency improvement, while TOPD matches full OPD performance using only 10% of the rollout horizon, yielding significant wall-clock and memory savings.