technique

CISPO

techniqueactiveprovisionalcispo-dfc10ca2·1 events·first seen 43h ago

Aliases: CISPO

Co-occurring entities

DAPO Reinforcement Learning with Verifiable Rewards Adaptive Clip Policy Optimization

More like this (12)

CAISI CISPA Helmholtz Center for Information Security IS-CoT SPPO CAGE CVSS-C SimPO CoRP ConSA CATT CSEE CICIDS

Recent events (1)

5arXiv · cs.CL·43h ago·source ↗

ACPO: Adaptive Clip Policy Optimization improves RLVR training for LLM reasoning

A new arXiv preprint provides theoretical analysis of Reinforcement Learning from Verifiable Rewards (RLVR) updates, identifying off-policy degree and gradient expectation as key factors governing update dynamics. The authors show that differences in gradient steps per rollout substantially affect importance sampling ratio distributions and which tokens dominate updates. Based on this analysis, they propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries per token group by empirical variance of importance sampling ratios, outperforming DAPO and CISPO baselines on 3B and 7B models across math, tabular QA, and logic benchmarks.

Evaluation and Benchmarking Alignment and RLHF DAPO CISPO Reinforcement Learning with Verifiable Rewards +1 more