technique
GSPO (Group Sequence Policy Optimization)
techniqueactive
gspo-group-sequence-policy-optimization--a1fe0b54·1 events·first seen 1mo agoAliases: GSPO (Group Sequence Policy Optimization)
Co-occurring entities
More like this (12)
Recent events (1)
GSPO: Group Sequence Policy Optimization for Scalable RL Training of Language Models
Qwen researchers introduce Group Sequence Policy Optimization (GSPO), a new RL algorithm designed to address severe training instability and model collapse observed in existing methods like GRPO during extended training runs. The core motivation is enabling stable RL scaling for language models to improve reasoning and problem-solving capabilities with increased compute. The paper targets a known bottleneck in post-training pipelines where instability prevents further performance gains.