Almanac
Concept guide · In-depth

GRPO: Group Relative Policy Optimization for LLM Post-Training

GRPO (Group Relative Policy Optimization)In-depthactive·v1 · live·generated 38h ago
TL;DRGRPO is a reinforcement learning algorithm for post-training language models that sidesteps the need for a separate critic network by scoring each candidate response relative to others in the same sampled group. It became a widely adopted baseline for reasoning and alignment fine-tuning, but active research is now exposing its limits — training instability at scale, high-variance credit assignment across long chains of thought, and a surprising vulnerability to alignment-breaking via a single biased example.

Key takeaways

  • GRPO eliminates the value-network overhead of PPO by computing advantages group-internally, making it cheaper to run but introducing its own instability at extended training horizons.
  • Qwen researchers introduced GSPO specifically to address severe training instability and model collapse observed in GRPO during long RL runs.
  • A single biased GRPO training example is sufficient to induce systematic, generalizing stereotype-driven bias in an aligned LLM — a critical alignment vulnerability.
  • ADS (Adaptive Data Scheduling) improves average accuracy by 5.2% over GRPO by replacing its uniform sampling with semantically-aware, policy-boundary-aware data selection.
  • RREDCoT targets GRPO's delayed-reward problem in chain-of-thought training by redistributing rewards across reasoning segments without expensive Monte Carlo sampling.
  • GLM-5.2 explicitly switched from GRPO to PPO for long-horizon RL training, citing GRPO's limitations for agentic tasks.

What it is

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for post-training language models. Its defining characteristic is how it estimates the advantage of a given response — the signal that tells the policy whether to make that response more or less likely. Rather than training a separate critic (value) network to predict expected reward, GRPO samples a group of candidate responses for each prompt and scores each one relative to the group's mean reward. The group itself is the baseline. This makes GRPO substantially cheaper to run than PPO, which requires maintaining and updating a full critic model in parallel with the policy.

How it works

The training loop proceeds as follows: for each prompt in the batch, the current policy generates G candidate responses. Each response is scored by a reward model (or a rule-based verifier for tasks with ground-truth answers, such as math). The advantage for response i is computed as its reward minus the group mean, normalized by the group standard deviation. The policy is then updated to increase the probability of high-advantage responses and decrease the probability of low-advantage ones, subject to a KL-divergence constraint that prevents the policy from drifting too far from a reference model in a single step.

`` For each prompt x: Sample G responses {y_1 ... y_G} from policy π_θ Score each: r_i = R(x, y_i) Advantage: A_i = (r_i − mean(r)) / std(r) Update θ to maximize E[A_i · log π_θ(y_i | x)], clipped by KL constraint ``

The absence of a critic is the key engineering tradeoff: it reduces memory and compute overhead but removes the smoothing effect a learned value function provides, which matters most during long training runs.

Why it matters

GRPO became a widely adopted baseline for reasoning model post-training because it is simple to implement, computationally lean, and effective on tasks with verifiable rewards — math, code, structured output — where a rule-based reward signal is available and responses are short enough that delayed reward is not a severe problem. Its adoption across reasoning fine-tuning pipelines made it the de facto reference point against which newer RL methods are evaluated.

Variants and active research

The events bundle captures a dense cluster of work either extending GRPO or addressing its failure modes:

N-GRPO targets redundant rollout trajectories — a known inefficiency where sampled responses are too similar to provide diverse learning signal. It improves exploration by mixing anchor token embeddings with nearest semantic neighbors during rollout, showing consistent gains on math reasoning benchmarks.

RREDCoT addresses the delayed-reward problem in chain-of-thought training. GRPO assigns a single reward at the end of a full response; for long reasoning traces this creates high-variance gradient estimates. RREDCoT redistributes reward across segments of the reasoning chain using the model itself to approximate optimal credit assignment, avoiding the computational cost of Monte Carlo sampling.

CORA identifies a distinct failure mode in multimodal RLVR: a systematic inconsistency between the reasoning trace and the final answer that persists throughout GRPO training. It introduces a consistency reward model and a Hybrid Reward Advantage Splitting mechanism to coordinate task and consistency optimization.

ADS (Adaptive Data Scheduling) attacks a structural limitation in GRPO's training data pipeline rather than the algorithm itself. Standard GRPO uses uniform sampling over the training set; ADS replaces this with adaptive distribution over semantic clusters and policy-boundary sample selection, achieving a 5.2% average accuracy improvement over GRPO across three LLMs and seven reasoning benchmarks.

GSPO (Group Sequence Policy Optimization), introduced by Qwen researchers, is the most direct algorithmic successor. It was motivated by severe training instability and model collapse observed in GRPO during extended training runs — a bottleneck that prevents further performance gains when scaling RL compute. GSPO is designed to enable stable RL scaling where GRPO breaks down.

Known limitations and pitfalls

Training instability at scale. The most practically significant limitation: GRPO's stability degrades during long RL runs, leading to model collapse. This is why GLM-5.2 — a 753B MoE model optimized for long-horizon agentic coding — explicitly switched from GRPO to PPO for its RL training stage, citing GRPO's unsuitability for extended agentic horizons.

Alignment fragility. Research demonstrates that a single biased training example under GRPO is sufficient to induce systematic, generalizing stereotype-driven bias in an aligned LLM, overriding safety guardrails. The attack generalizes across attributes, categories, and benchmarks, and susceptibility scales with the model's initial likelihood of producing biased outputs. This is a critical finding for any deployment pipeline that allows fine-tuning on user-provided data.

Delayed reward in long reasoning traces. Assigning a single terminal reward to a multi-step chain of thought creates high-variance gradient signals. This is a structural mismatch between GRPO's design (short, independently scorable responses) and the demands of extended reasoning tasks.

Uniform data sampling. GRPO's standard training loop does not account for the semantic structure of the training distribution or the evolving capability of the policy, which ADS addresses with measurable gains.

When to use GRPO — and when not to

GRPO is a strong default for short-to-medium horizon tasks with verifiable rewards: math, structured code generation, constrained output formats. It is cheap, well-understood, and well-supported in open tooling.

Prefer PPO when training long-horizon agentic policies where stability over many gradient steps is required. Prefer GSPO when extended RL scaling is the goal and GRPO instability has become the binding constraint. Prefer DPO when online rollouts are too expensive and preference pairs are available. In any pipeline that accepts external fine-tuning data, treat GRPO's alignment fragility as a first-class security concern — the one-shot attack surface is real and documented.

GRPO training loop and its known failure modes

GRPO vs. key RL post-training alternatives

MethodCritic network?Key advantageKnown limitationNotable usage
GRPONoLow overhead; simple group-relative scoringInstability at scale; alignment fragility; delayed-reward variance in CoTWidespread reasoning fine-tuning baseline
PPOYesStable long-horizon trainingHigher compute cost (critic overhead)GLM-5.2 long-horizon agentic RL
GSPONoDesigned for stable extended RL runsNewer; less field-testedQwen research pipeline
DPONoNo online rollouts neededRequires preference pairs; no online explorationAlignment fine-tuning from preference data

Synthesized from the events bundle; unknown cells render —.

Timeline

  1. GSPO introduced to address GRPO instability and model collapse at scale

  2. RREDCoT proposes segment-level reward redistribution to fix GRPO's delayed-reward variance in CoT

  3. One-shot GRPO alignment break demonstrated: single biased example overrides safety guardrails

  4. CORA identifies thinking-answer inconsistency persisting throughout GRPO training in multimodal RLVR

  5. ADS achieves 5.2% accuracy gain over GRPO via adaptive semantic data scheduling

  6. GLM-5.2 switches from GRPO to PPO for long-horizon agentic RL training

Related topics

GSPO (Group Sequence Policy Optimization)Chain-of-Thought ReasoningDeepSeek V4

FAQ

Why does GRPO not need a critic network?

GRPO computes the advantage of each response by comparing its reward to the mean reward of the other responses sampled for the same prompt — the group itself acts as the baseline, making a learned value function unnecessary.

What is GRPO's main failure mode at scale?

Qwen researchers observed severe training instability and model collapse during extended GRPO runs, which motivated the development of GSPO as a more stable alternative for long RL training horizons.

How fragile is GRPO-trained alignment?

Research shows a single biased training example under GRPO is sufficient to induce systematic, generalizing stereotype-driven bias that overrides safety guardrails — susceptibility scales with the model's initial likelihood of producing biased outputs.

Why does GRPO struggle with chain-of-thought reasoning?

GRPO assigns reward at the end of a full response, creating high-variance delayed-reward signals across long reasoning traces; methods like RREDCoT address this by redistributing reward across intermediate reasoning segments.

Is GRPO still the right choice for agentic tasks?

Evidence from GLM-5.2's development suggests PPO is preferable for long-horizon agentic RL, where GRPO's instability becomes a practical bottleneck; GRPO remains a strong baseline for shorter-horizon reasoning tasks.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live38h ago

Related guides (4)

More on GRPO (Group Relative Policy Optimization) (6)

7arXiv · cs.CL·21d ago·source ↗

One-shot GRPO training on a single biased example can break LLM alignment

A new arXiv paper demonstrates that a single biased training example using Group Relative Policy Optimization (GRPO) is sufficient to induce systematic bias in aligned LLMs, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. The authors find that model susceptibility varies based on the initial likelihood of producing biased outputs. The result exposes a critical vulnerability in post-training alignment: a minimal fine-tuning intervention can override safety guardrails.

4arXiv · cs.CL·8d ago·source ↗

P4IR framework uses SFT + GRPO to improve LLM-based automated building code compliance

Researchers introduce P4IR, a two-stage framework combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to improve LLM accuracy in automated code compliance (ACC) for building regulations. The approach reduces tree edit distance and token-level Levenshtein distance by up to 23.8% and 38.6% respectively versus SFT baselines, and outperforms Claude Opus/Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7 in zero-shot settings. The work targets a narrow but practically important domain where LLM hallucinations carry real regulatory consequences.

7Qwen Research·1mo ago·source ↗

GSPO: Group Sequence Policy Optimization for Scalable RL Training of Language Models

Qwen researchers introduce Group Sequence Policy Optimization (GSPO), a new RL algorithm designed to address severe training instability and model collapse observed in existing methods like GRPO during extended training runs. The core motivation is enabling stable RL scaling for language models to improve reasoning and problem-solving capabilities with increased compute. The paper targets a known bottleneck in post-training pipelines where instability prevents further performance gains.

4arXiv · cs.CL·21d ago·source ↗

N-GRPO: Semantic Neighbor Mixing for Improved Policy Optimization in LLM Reasoning

A new arXiv preprint introduces N-GRPO, an exploration strategy for the GRPO reinforcement learning framework that improves solution diversity during rollout by mixing embeddings of anchor tokens with their nearest semantic neighbors rather than using token-level sampling or random noise. The method is evaluated on DeepSeek-R1-Distill-Qwen models of various sizes and shows consistent improvements on math reasoning benchmarks plus out-of-distribution generalization. The work targets a known limitation in RLHF-style training: redundant rollout trajectories that reduce effective learning signal.

5arXiv · cs.CL·15d ago·source ↗

CORA: Consistency-Oriented Reasoning Alignment addresses thinking-answer gap in multimodal RLVR

Researchers identify and analyze a systematic inconsistency between reasoning traces and final answers in RLVR-trained large vision-language models, showing the problem persists throughout GRPO training and inference. They propose CORA, which introduces a lightweight plug-and-play consistency reward model and a Hybrid Reward Advantage Splitting (HRAS) mechanism to coordinate task and consistency optimization. Experiments across multimodal reasoning benchmarks show CORA improves both task performance and reasoning faithfulness.

5arXiv · cs.LG·25d ago·source ↗

RREDCoT: Segment-level reward redistribution for chain-of-thought reasoning via self-approximated credit assignment

RREDCoT is a new method for redistributing rewards across segments of Chain-of-Thought traces during RL fine-tuning of reasoning language models, addressing the high-variance delayed-reward problem inherent in GRPO-style training. Rather than using computationally expensive Monte Carlo sampling for intermediate state value estimation, the method uses the model itself to approximate optimal reward redistribution without additional generation passes. The paper evaluates RREDCoT against MC sampling and several attribution baselines, analyzing segmentation strategies and state value estimation. This is relevant to the active research thread on improving RL fine-tuning stability and efficiency for reasoning models.