Almanac
← Events
5arXiv cs.CL (Computation and Language)·43h ago

ACPO: Adaptive Clip Policy Optimization improves RLVR training for LLM reasoning

A new arXiv preprint provides theoretical analysis of Reinforcement Learning from Verifiable Rewards (RLVR) updates, identifying off-policy degree and gradient expectation as key factors governing update dynamics. The authors show that differences in gradient steps per rollout substantially affect importance sampling ratio distributions and which tokens dominate updates. Based on this analysis, they propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries per token group by empirical variance of importance sampling ratios, outperforming DAPO and CISPO baselines on 3B and 7B models across math, tabular QA, and logic benchmarks.

Related guides (3)

Related events (8)

5arXiv · cs.CL·1mo ago·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

6arXiv · cs.CL·6d ago·source ↗

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.

5arXiv · cs.LG·15d ago·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

6arXiv · cs.AI·1mo ago·source ↗

POW3R: Policy-Aware Rubric Rewards for More Efficient RLVR Training

This paper identifies a failure mode in rubric-based reinforcement learning with verifiable rewards (RLVR): static aggregation of criterion weights conflates human-assigned importance with current optimization utility, causing many criteria to be either already saturated or unreachable. The authors introduce POW3R, a framework that dynamically reweights criterion-level rewards during training using rollout-level contrast to emphasize criteria that currently differentiate policy outputs. Across three base policies and two datasets (multimodal and text-only), POW3R wins 24 of 30 comparisons on rubric reward and strict completion metrics, and reaches equivalent performance in 2.5–4× fewer training steps than vanilla GRPO with rubric rewards.

7arXiv · cs.AI·1mo ago·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

6arXiv · cs.LG·13d ago·source ↗

APPO: Fine-grained branching and credit assignment for agentic RL in LLMs

Researchers introduce Agentic Procedural Policy Optimization (APPO), a reinforcement learning method that shifts branching and credit assignment from coarse tool-call boundaries to fine-grained decision points within generated sequences. APPO uses a Branching Score combining token uncertainty with policy-induced likelihood gains to select exploration points, plus procedure-level advantage scaling for credit distribution. Evaluated on 13 benchmarks, APPO improves strong agentic RL baselines by nearly 4 points while maintaining efficient tool use and interpretability. The work addresses a known weakness in multi-turn agentic RL: that influential decisions are distributed throughout sequences, not concentrated at tool-call boundaries.

6arXiv · cs.LG·8d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

5arXiv · cs.CL·43h ago·source ↗

Adaptive Data Scheduling (ADS) improves LLM reinforcement learning post-training by 5.2% over GRPO

Researchers propose Adaptive Data Scheduling (ADS), a dual-level framework that replaces uniform sampling in RL post-training with adaptive distribution over semantic clusters and policy-boundary sample selection. Evaluated across three LLMs and seven reasoning benchmarks, ADS improves average accuracy by 5.2% over GRPO and generalizes across RL objectives. The method addresses a structural limitation in standard RL post-training pipelines by accounting for semantic data structure and evolving policy capability during training.