technique
Adaptive Clip Policy Optimization
techniqueactiveprovisional
adaptive-clip-policy-optimization-5b17811c·1 events·first seen 42h agoAliases: Adaptive Clip Policy Optimization
Co-occurring entities
More like this (12)
CLIPVector Policy OptimizationPreference Coordinated Multi-agent Policy OptimizationHierarchical Relative Policy OptimizationAPPO: Agentic Procedural Policy OptimizationProximal Policy OptimizationLayer-Adaptive Expert PruningunCLIPPareto Optimal Policy OptimizationCLIPSegObserve-and-Act Adaptive Context SelectionAdaptive Data Scheduling
Recent events (1)
ACPO: Adaptive Clip Policy Optimization improves RLVR training for LLM reasoning
A new arXiv preprint provides theoretical analysis of Reinforcement Learning from Verifiable Rewards (RLVR) updates, identifying off-policy degree and gradient expectation as key factors governing update dynamics. The authors show that differences in gradient steps per rollout substantially affect importance sampling ratio distributions and which tokens dominate updates. Based on this analysis, they propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries per token group by empirical variance of importance sampling ratios, outperforming DAPO and CISPO baselines on 3B and 7B models across math, tabular QA, and logic benchmarks.