Almanac
technique

GSPO (Group Sequence Policy Optimization)

techniqueactivegspo-group-sequence-policy-optimization--a1fe0b54·1 events·first seen 1mo ago

Aliases: GSPO (Group Sequence Policy Optimization)

Co-occurring entities

More like this (12)

Recent events (1)

7Qwen Research·1mo ago·source ↗

GSPO: Group Sequence Policy Optimization for Scalable RL Training of Language Models

Qwen researchers introduce Group Sequence Policy Optimization (GSPO), a new RL algorithm designed to address severe training instability and model collapse observed in existing methods like GRPO during extended training runs. The core motivation is enabling stable RL scaling for language models to improve reasoning and problem-solving capabilities with increased compute. The paper targets a known bottleneck in post-training pipelines where instability prevents further performance gains.