6arXiv cs.CL (Computation and Language)·2d ago

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.

Frontier Model Releases Alignment and RLHF DAPO AIME 2026 GRPO STARE AIME 2025

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·17d ago·source ↗

VEPO: Vision-anchored token selection improves RL for visual reasoning

A new arXiv paper identifies a failure mode of entropy-based credit assignment in multimodal reinforcement learning: vision-sensitive tokens with naturally low entropy are systematically ignored, causing the mechanism to collapse in visual reasoning tasks. The authors propose VEPO (Vision-Entropy token-selection for Policy Optimization), which couples visual sensitivity with token entropy via a multiplicative scheme to redirect gradient credit toward tokens that are both visually grounded and semantically informative. VEPO outperforms entropy-only baselines by 2.28 points at 7B scale and 3.15 points at 3B scale on visual reasoning benchmarks.

Alignment and RLHF Multimodal Progress VEPO Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

4arXiv · cs.CL·11d ago·source ↗

N-GRPO: Semantic Neighbor Mixing for Improved Policy Optimization in LLM Reasoning

A new arXiv preprint introduces N-GRPO, an exploration strategy for the GRPO reinforcement learning framework that improves solution diversity during rollout by mixing embeddings of anchor tokens with their nearest semantic neighbors rather than using token-level sampling or random noise. The method is evaluated on DeepSeek-R1-Distill-Qwen models of various sizes and shows consistent improvements on math reasoning benchmarks plus out-of-distribution generalization. The work targets a known limitation in RLHF-style training: redundant rollout trajectories that reduce effective learning signal.

Evaluation and Benchmarking Alignment and RLHF N-GRPO DeepSeek-R1-Distill-Qwen Semantic Neighbor Mixing +1 more

6arXiv · cs.CL·29d ago·source ↗

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

This paper proposes a multi-reward reinforcement learning from internal feedback (RLIF) framework that decomposes training signals into an answer-level reward via cluster voting and a completion-level reward via token-wise self-certainty. To address reward hacking and entropy collapse common in single-reward RLIF, the authors introduce GDPO-based normalization and KL-Cov regularization targeting low-entropy token distributions. Evaluated on mathematical reasoning and code-generation benchmarks, the method achieves stability and performance approaching supervised RLVR methods without requiring external ground-truth supervision. The work advances scalable unsupervised RL training for LLM reasoning.

AI Safety Research Alignment and RLHF KL-Cov regularization token-wise self-certainty cluster voting reward +3 more

6arXiv · cs.CL·1mo ago·source ↗

DelTA: Discriminative Token Credit Assignment for RLVR Training

DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.

Frontier Model Releases Evaluation and Benchmarking DelTA Qwen3-8B-Base policy gradient +5 more

5arXiv · cs.CL·1mo ago·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more

5arXiv · cs.CL·11d ago·source ↗

GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment

Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.

Inference Economics Alignment and RLHF Best-of-N Sampling Gradient-Guided Reward Optimization

6arXiv · cs.AI·1mo ago·source ↗

POW3R: Policy-Aware Rubric Rewards for More Efficient RLVR Training

This paper identifies a failure mode in rubric-based reinforcement learning with verifiable rewards (RLVR): static aggregation of criterion weights conflates human-assigned importance with current optimization utility, causing many criteria to be either already saturated or unreachable. The authors introduce POW3R, a framework that dynamically reweights criterion-level rewards during training using rollout-level contrast to emphasize criteria that currently differentiate policy outputs. Across three base policies and two datasets (multimodal and text-only), POW3R wins 24 of 30 comparisons on rubric reward and strict completion metrics, and reaches equivalent performance in 2.5–4× fewer training steps than vanilla GRPO with rubric rewards.

Evaluation and Benchmarking Alignment and RLHF rubric-based rewards GRPO POW3R +2 more

5arXiv · cs.LG·11d ago·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

Alignment and RLHF Divergence Regularized Policy Optimization GRPO PPO +1 more