Adaptive Data Scheduling (ADS) improves LLM reinforcement learning post-training by 5.2% over GRPO
Researchers propose Adaptive Data Scheduling (ADS), a dual-level framework that replaces uniform sampling in RL post-training with adaptive distribution over semantic clusters and policy-boundary sample selection. Evaluated across three LLMs and seven reasoning benchmarks, ADS improves average accuracy by 5.2% over GRPO and generalizes across RL objectives. The method addresses a structural limitation in standard RL post-training pipelines by accounting for semantic data structure and evolving policy capability during training.
Related guides (2)
Related events (8)
ACPO: Adaptive Clip Policy Optimization improves RLVR training for LLM reasoning
A new arXiv preprint provides theoretical analysis of Reinforcement Learning from Verifiable Rewards (RLVR) updates, identifying off-policy degree and gradient expectation as key factors governing update dynamics. The authors show that differences in gradient steps per rollout substantially affect importance sampling ratio distributions and which tokens dominate updates. Based on this analysis, they propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries per token group by empirical variance of importance sampling ratios, outperforming DAPO and CISPO baselines on 3B and 7B models across math, tabular QA, and logic benchmarks.
N-GRPO: Semantic Neighbor Mixing for Improved Policy Optimization in LLM Reasoning
A new arXiv preprint introduces N-GRPO, an exploration strategy for the GRPO reinforcement learning framework that improves solution diversity during rollout by mixing embeddings of anchor tokens with their nearest semantic neighbors rather than using token-level sampling or random noise. The method is evaluated on DeepSeek-R1-Distill-Qwen models of various sizes and shows consistent improvements on math reasoning benchmarks plus out-of-distribution generalization. The work targets a known limitation in RLHF-style training: redundant rollout trajectories that reduce effective learning signal.
STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training
Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.
ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models
Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs
LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.
APPO: Fine-grained branching and credit assignment for agentic RL in LLMs
Researchers introduce Agentic Procedural Policy Optimization (APPO), a reinforcement learning method that shifts branching and credit assignment from coarse tool-call boundaries to fine-grained decision points within generated sequences. APPO uses a Branching Score combining token uncertainty with policy-induced likelihood gains to select exploration points, plus procedure-level advantage scaling for credit distribution. Evaluated on 13 benchmarks, APPO improves strong agentic RL baselines by nearly 4 points while maintaining efficient tool use and interpretability. The work addresses a known weakness in multi-turn agentic RL: that influential decisions are distributed throughout sequences, not concentrated at tool-call boundaries.
SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering
SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

