technique
DAPO
techniqueactiveprovisional
dapo-4506c9eb·1 events·first seen 2d agoAliases: DAPO
Co-occurring entities
More like this (12)
Recent events (1)
STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training
Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.