technique
CISPO
techniqueactiveprovisional
cispo-dfc10ca2·1 events·first seen 43h agoAliases: CISPO
Co-occurring entities
More like this (12)
Recent events (1)
ACPO: Adaptive Clip Policy Optimization improves RLVR training for LLM reasoning
A new arXiv preprint provides theoretical analysis of Reinforcement Learning from Verifiable Rewards (RLVR) updates, identifying off-policy degree and gradient expectation as key factors governing update dynamics. The authors show that differences in gradient steps per rollout substantially affect importance sampling ratio distributions and which tokens dominate updates. Based on this analysis, they propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries per token group by empirical variance of importance sampling ratios, outperforming DAPO and CISPO baselines on 3B and 7B models across math, tabular QA, and logic benchmarks.