Concept guide · Beginner

GRPO: The Reinforcement Learning Trick Behind Smarter AI Reasoning

GRPO (Group Relative Policy Optimization)Beginneractive·v1 · live·generated 38h ago

TL;DRGRPO (Group Relative Policy Optimization) is a reinforcement learning technique that teaches AI models to reason better by comparing a batch of their own answers against each other, rather than relying on a separate "critic" model. It became a popular ingredient in post-training pipelines for reasoning models, but active research is now exposing its limits — from training instability to surprising safety vulnerabilities — and spawning a wave of improvements and alternatives.

Key takeaways

A single biased training example using GRPO can override a model's safety guardrails, according to a 2026 arXiv paper — a significant alignment risk.
Qwen researchers introduced GSPO specifically to fix severe training instability and model collapse observed in GRPO during long training runs.
Adaptive Data Scheduling (ADS) improves average reasoning accuracy by 5.2% over standard GRPO by replacing its uniform data sampling with smarter selection.
Z.ai's GLM-5.2 — a top open-weights model — switched away from GRPO to PPO for long-horizon training, citing GRPO's limitations at scale.
GRPO is actively used in real-world applications: a building code compliance system (P4IR) using SFT + GRPO outperformed GPT-5.2 and Claude Opus/Sonnet 4.5 in zero-shot settings.

What GRPO is

GRPO stands for Group Relative Policy Optimization. It is a reinforcement learning (RL) technique used to make AI language models better at reasoning tasks — things like math, coding, and multi-step problem solving.

Here's the core idea in plain terms: instead of asking a separate "judge" model to score every answer the AI produces, GRPO has the model generate a group of answers to the same question, then scores them relative to each other. Good answers in the group get rewarded; bad ones get penalized. The model learns from this comparison and gradually improves.

This matters because training a separate judge model is expensive and complicated. GRPO sidesteps that cost, making it a practical tool for the "post-training" phase — the stage after a model has learned from text, when researchers fine-tune it to reason more carefully.

Why it became popular

GRPO fits neatly into what's called the post-training pipeline: take a capable base model, then use RL to push its reasoning further. Researchers and companies adopted it because it is relatively straightforward to implement and doesn't require the full machinery of older RL methods like PPO (Proximal Policy Optimization).

Real-world use cases have followed. A research team built P4IR, a two-stage system combining standard supervised fine-tuning with GRPO, to help AI check whether buildings comply with regulations — a domain where hallucinations carry real legal consequences. Their system outperformed GPT-5.2 and Claude Opus/Sonnet 4.5 in zero-shot settings, reducing key error metrics by up to 38.6%.

GRPO has also been applied to multimodal models (those that handle both images and text), where a problem called the "thinking-answer gap" has been observed: the model's reasoning trace and its final answer sometimes contradict each other. Research on CORA found this inconsistency persists throughout GRPO training, motivating new fixes.

The problems researchers are finding

GRPO's popularity has made its weaknesses more visible.

Training instability. Qwen researchers introduced a new algorithm called GSPO (Group Sequence Policy Optimization) specifically because GRPO suffers from severe instability and even model collapse during long training runs. This is a hard ceiling: if training breaks down, you can't keep scaling up compute to get better results.

A safety vulnerability. A striking 2026 paper showed that feeding a model just one biased training example via GRPO is enough to systematically corrupt its alignment — causing stereotype-driven reasoning to generalize across topics and benchmarks. This is alarming because it means a minimal, targeted intervention can undo the safety work built into a model.

Reward signal problems. Standard GRPO assigns a single reward to an entire answer, but reasoning models produce long chains of thought. RREDCoT is a new method that redistributes rewards across individual segments of a reasoning chain, giving the model more precise feedback about which parts of its thinking were good or bad.

Redundant exploration. N-GRPO addresses a different issue: when GRPO generates its group of answers, they often end up too similar to each other, reducing the useful learning signal. N-GRPO mixes in semantically related alternatives to increase diversity.

Data sampling. Adaptive Data Scheduling (ADS) replaces GRPO's uniform random sampling of training examples with a smarter approach that accounts for the structure of the data and how the model's ability is evolving. Across seven reasoning benchmarks, ADS improved accuracy by 5.2% over standard GRPO.

When labs move on

Z.ai's GLM-5.2 — a 753-billion-parameter open-weights model that leads open-source rankings on agentic coding benchmarks — explicitly switched from GRPO to PPO for its long-horizon RL training. The team cited GRPO's limitations at scale as the reason. This is a signal that for the most demanding training regimes, GRPO's simplicity comes at a cost.

Where things stand

GRPO is not going away. It remains a practical, widely-used tool for improving reasoning in language models, and it continues to appear in new research and applications. But the field is clearly in a phase of stress-testing it: finding its failure modes, patching them with add-ons like ADS and RREDCoT, and in some cases replacing it with purpose-built successors like GSPO. The safety vulnerability — that a single example can break alignment — is the most urgent concern and is likely to drive new guardrails around how GRPO-style training is deployed.

GRPO: How the training loop works

GRPO and its main alternatives / improvements

Method	How it scores outputs	Key strength	Known limitation
GRPO	Relative comparison within a group	No separate critic model needed	Instability at scale; safety vulnerability
PPO	Separate critic (value) model	Stable for long-horizon training	More complex and expensive to run
GSPO	Group sequence-level scoring	Addresses GRPO collapse in long runs	Newer; less widely adopted
ADS (wrapper)	Adaptive data sampling over GRPO	+5.2% accuracy vs. standard GRPO	Adds scheduling complexity

Synthesized from the events bundle; unknown cells render —.

Timeline

FAQ

What makes GRPO different from older RL methods like PPO?

PPO requires a separate critic model to evaluate outputs, which adds complexity and cost. GRPO skips that by comparing a batch of the model's own answers against each other — simpler, but with its own stability trade-offs.

Is GRPO safe to use for fine-tuning?

Research shows a single biased training example fed through GRPO can override a model's safety guardrails, so care is needed — especially when fine-tuning on narrow or user-supplied data.

Is GRPO being replaced?

Not entirely, but alternatives like GSPO (from Qwen) address its instability at scale, and some labs like Z.ai have switched to PPO for long-horizon training. GRPO remains common for shorter, more focused fine-tuning tasks.

What kinds of tasks benefit most from GRPO training?

Math reasoning, coding, and structured problem-solving — tasks where answers can be checked for correctness, giving the model a clear reward signal to learn from.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

v1live38h ago

Related guides (4)

GRPO (Group Relative Policy Optimization)Concept

GRPO: Group Relative Policy Optimization for LLM Post-Training

Read asIn-depth

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

Proximal Policy OptimizationConcept

Proximal Policy Optimization (PPO): The Algorithm That Trains AI to Learn from Feedback

Read asBeginner In-depth

Direct Preference Optimization (DPO)Concept

Direct Preference Optimization (DPO): Reward-Free Alignment for LLMs

Read asIn-depth

More on GRPO (Group Relative Policy Optimization) (6)

7arXiv · cs.CL·21d ago·source ↗

One-shot GRPO training on a single biased example can break LLM alignment

A new arXiv paper demonstrates that a single biased training example using Group Relative Policy Optimization (GRPO) is sufficient to induce systematic bias in aligned LLMs, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. The authors find that model susceptibility varies based on the initial likelihood of producing biased outputs. The result exposes a critical vulnerability in post-training alignment: a minimal fine-tuning intervention can override safety guardrails.

AI Safety Research Alignment and RLHF It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO GRPO (Group Relative Policy Optimization)

4arXiv · cs.CL·8d ago·source ↗

P4IR framework uses SFT + GRPO to improve LLM-based automated building code compliance

Researchers introduce P4IR, a two-stage framework combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to improve LLM accuracy in automated code compliance (ACC) for building regulations. The approach reduces tree edit distance and token-level Levenshtein distance by up to 23.8% and 38.6% respectively versus SFT baselines, and outperforms Claude Opus/Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7 in zero-shot settings. The work targets a narrow but practically important domain where LLM hallucinations carry real regulatory consequences.

Enterprise Deployment Patterns Alignment and RLHF GPT-5.2 Claude Opus 4.6 Claude Sonnet 4.5 +4 more

7Qwen Research·1mo ago·source ↗

GSPO: Group Sequence Policy Optimization for Scalable RL Training of Language Models

Qwen researchers introduce Group Sequence Policy Optimization (GSPO), a new RL algorithm designed to address severe training instability and model collapse observed in existing methods like GRPO during extended training runs. The core motivation is enabling stable RL scaling for language models to improve reasoning and problem-solving capabilities with increased compute. The paper targets a known bottleneck in post-training pipelines where instability prevents further performance gains.

Training Infrastructure Frontier Model Releases Qwen GSPO (Group Sequence Policy Optimization)GRPO (Group Relative Policy Optimization)+2 more

4arXiv · cs.CL·21d ago·source ↗

N-GRPO: Semantic Neighbor Mixing for Improved Policy Optimization in LLM Reasoning

A new arXiv preprint introduces N-GRPO, an exploration strategy for the GRPO reinforcement learning framework that improves solution diversity during rollout by mixing embeddings of anchor tokens with their nearest semantic neighbors rather than using token-level sampling or random noise. The method is evaluated on DeepSeek-R1-Distill-Qwen models of various sizes and shows consistent improvements on math reasoning benchmarks plus out-of-distribution generalization. The work targets a known limitation in RLHF-style training: redundant rollout trajectories that reduce effective learning signal.

Evaluation and Benchmarking Alignment and RLHF N-GRPO DeepSeek-R1-Distill-Qwen Semantic Neighbor Mixing +1 more

5arXiv · cs.CL·15d ago·source ↗

CORA: Consistency-Oriented Reasoning Alignment addresses thinking-answer gap in multimodal RLVR

Researchers identify and analyze a systematic inconsistency between reasoning traces and final answers in RLVR-trained large vision-language models, showing the problem persists throughout GRPO training and inference. They propose CORA, which introduces a lightweight plug-and-play consistency reward model and a Hybrid Reward Advantage Splitting (HRAS) mechanism to coordinate task and consistency optimization. Experiments across multimodal reasoning benchmarks show CORA improves both task performance and reasoning faithfulness.

Evaluation and Benchmarking Alignment and RLHF CORA Hybrid Reward Advantage Splitting GRPO (Group Relative Policy Optimization)+1 more

5arXiv · cs.LG·25d ago·source ↗

RREDCoT: Segment-level reward redistribution for chain-of-thought reasoning via self-approximated credit assignment

RREDCoT is a new method for redistributing rewards across segments of Chain-of-Thought traces during RL fine-tuning of reasoning language models, addressing the high-variance delayed-reward problem inherent in GRPO-style training. Rather than using computationally expensive Monte Carlo sampling for intermediate state value estimation, the method uses the model itself to approximate optimal reward redistribution without additional generation passes. The paper evaluates RREDCoT against MC sampling and several attribution baselines, analyzing segmentation strategies and state value estimation. This is relevant to the active research thread on improving RL fine-tuning stability and efficiency for reasoning models.

Alignment and RLHF RREDCoT: Segment-Level Reward Redistribution for Reasoning Models Chain-of-Thought Reasoning GRPO (Group Relative Policy Optimization)+1 more