Almanac
Concept guide · Beginner

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

GRPOBeginneractive·v1 · live·generated 6d ago

Part of these paths

TL;DRGRPO — Group Relative Policy Optimization — is a way to teach AI models to reason better by having them try a problem many times, then rewarding the attempts that worked out best. It has become the go-to training recipe for a wide range of tasks because it skips the expensive "critic" model that older approaches required, making it practical on modest hardware. Researchers are now pushing GRPO into new territory — coding agents, multilingual reasoning, robotics, and safety — while also identifying its limits and building smarter variants on top of it.

Key takeaways

  • Hugging Face's TRL library (v1.0) ships GRPO alongside PPO and DPO, making it accessible to any practitioner with a GPU.
  • SCOPE, a self-play framework, matched GRPO trained on ~9K curated prompts with zero external supervision — showing GRPO is now the baseline to beat.
  • AXPO found that under standard GRPO training, models only attempt tool use in ~30% of rollouts, exposing a structural gap for agentic tasks.
  • MobileGym used GRPO on Qwen3-VL-4B to achieve +12.8 percentage points on a mobile GUI benchmark, with 95.1% of gains transferring to real devices.
  • Multiple GRPO variants — LamPO, IH-GRPO, VPO, DRPO, POW3R — have emerged to fix specific weaknesses: training instability, sparse rewards, tool-use coherence, and diversity.
  • OpenPipe's ART library (nearly 10K GitHub stars) packages GRPO specifically for multi-step agent training, signaling strong practitioner demand.

What GRPO is

GRPO stands for Group Relative Policy Optimization. In plain terms, it is a training recipe for making AI language models better at tasks that require reasoning — things like solving math problems, writing and debugging code, or deciding which tool to call next.

The core idea is simple: instead of asking the model one question and grading the single answer, you ask it the same question many times, collect a group of attempts, and then reward the ones that worked out best relative to the others in that group. The model's weights are then nudged to make those better attempts more likely in the future.

What makes GRPO stand out is what it doesn't need. Older reinforcement learning methods like PPO require a second AI model — called a "critic" — that estimates how good a situation is at every step. Training and running that critic is expensive. GRPO sidesteps this entirely, which is why it can run on a single GPU where PPO might need a cluster.

Why you should care

GRPO is the engine behind a wave of "reasoning" AI improvements you may have heard about. When researchers talk about models that can check their own work, back up and try a different approach, or chain together many steps to solve a hard problem, GRPO (or something very close to it) is often what trained that behavior.

A Hugging Face tutorial showed how to reproduce the "aha moment" reasoning behavior seen in DeepSeek R1 using GRPO on a simple countdown game — making the technique accessible to anyone, not just large labs. Since then it has become the community's default starting point for teaching models new skills through reinforcement learning.

How it works (the simple version)

Think of it like a study group. A student (the model) tries a problem ten different ways. The group compares notes: some attempts got the right answer, some didn't. The student learns from the contrast — not from a teacher grading each step, but from seeing which of their own strategies paid off. Over many rounds of this, the student gets better at the kinds of problems the group practiced on.

The "reward" that tells GRPO which attempts were good can come from many sources: a simple right/wrong check on a math answer, a score from a code test suite, or a more complex rubric. This flexibility is a big part of why GRPO has spread across so many different tasks.

Where GRPO is being used

The range of applications in recent research is striking:

  • Math and reasoning: Multiple papers use GRPO as the baseline for training models on competition math (AIME benchmarks), with variants like LamPO and SGSD improving on it.
  • Multi-step tool use: The PROVE framework trains models to orchestrate sequences of tool calls using GRPO-style rewards, gaining up to +10.2 points on multi-turn benchmarks.
  • Mobile app control: MobileGym used GRPO to train a vision-language model to navigate phone apps, achieving +12.8 percentage points on a test set — and 95.1% of those gains transferred to real devices.
  • Robotics: Sony and university researchers used GRPO combined with LoRA to fine-tune robot control models with near-zero forgetting of previously learned tasks.
  • Multilingual reasoning: The Luar framework builds on GRPO to teach models when to translate a non-English question into English before answering, with especially large gains on low-resource languages.
  • Safety: AdvGRPO adapts GRPO for red-teaming — jointly training an attacker and a defender — to make models more robust.

Known limits and active fixes

GRPO is not a silver bullet. Researchers have documented several failure modes:

  • Sparse rewards: When the feedback signal is thin or hard to interpret (like empty brackets from a knowledge-graph API), GRPO training can "peak then collapse" — improving briefly before the model stops using the tool entirely.
  • Tool avoidance: One study found that under standard GRPO training, models only attempted tool use in about 30% of rollouts. The AXPO method was designed specifically to fix this.
  • Training instability: GRPO can be unstable when reward signals conflict or when the model's outputs are very long. DRPO and POW3R are two recent proposals that address this with smoother reward weighting.
  • Diversity: VPO (Vector Policy Optimization) argues that GRPO trains models to converge on a single best answer, which hurts performance when you want the model to explore many different solutions at test time.

The tooling ecosystem

GRPO has first-class support in several widely used open-source libraries:

  • TRL (Hugging Face) — the most widely used post-training library, now at v1.0, with GRPO alongside PPO and DPO. A recent update adds co-located vLLM inference to eliminate idle GPU time during training.
  • OpenPipe ART — a dedicated library for training multi-step agents with GRPO, with nearly 10,000 GitHub stars.
  • ms-swift (ModelScope) — supports GRPO across 600+ language models and 300+ multimodal models.

Where it's heading

GRPO has moved from a research curiosity to the default baseline in a remarkably short time. The current frontier is not whether to use it, but how to fix its rough edges: better reward design for non-verifiable tasks, smarter data selection (SAERL uses model internals to pick better training examples), and hybrid approaches that combine GRPO's efficiency with richer feedback signals like step-by-step critiques. The technique is also spreading beyond text — into vision-language models, robotics, and GUI agents — suggesting its influence will only grow as AI systems take on more complex, multi-step tasks in the real world.

How GRPO trains a model

GRPO vs. related post-training methods

MethodNeeds a critic model?Needs labeled data?Best known forKey limitation
GRPONoNeeds reward signalCheap, effective reasoning fine-tuningInstability with sparse/noisy rewards
PPOYesNeeds reward signalStable RL with value estimatesMore compute-heavy
DPONoYes — preference pairsAlignment from human preferencesOffline; no exploration
RLVRNoNeeds verifiable answersMath / code with exact-match rewardsFails on non-verifiable tasks

Synthesized from the events bundle; cells marked — where events do not specify.

Timeline

  1. Mini-R1 tutorial reproduces DeepSeek R1 'aha moment' using GRPO, bringing the technique to wide attention

  2. Liger Kernel's GRPO integrates with TRL, targeting memory-efficient training on constrained hardware

  3. TRL adds co-located vLLM support, eliminating idle GPUs during GRPO training loops

  4. TRL v1.0 released with stable API; GRPO listed as a first-class supported method

  5. SCOPE matches GRPO on open-ended benchmarks with no curated data, establishing GRPO as the community baseline

Related topics

PPODPORLVRTRLHugging FaceQwen3Qwen3-4BQwen3-1.7BQwen2.5-7BLlamaAIME25OpenVLA-OFT

FAQ

What does GRPO actually do?

It asks the model to attempt the same problem multiple times, then nudges the model's weights to make the better attempts more likely in the future — all without needing a separate 'critic' AI to judge quality.

Why is GRPO popular right now?

It's cheaper than older RL methods like PPO (no critic model to train and run) and more flexible than DPO (it can explore new answers rather than just learning from fixed pairs), making it practical on a single GPU.

What kinds of tasks is GRPO used for?

The events in this bundle show it being applied to math reasoning, multi-step tool use, mobile app control, multilingual reasoning, robotics, safety training, and even scientific knowledge extraction.

What are GRPO's known weaknesses?

Research shows it can be unstable with sparse or noisy reward signals, models trained with it often avoid using tools (only ~30% of rollouts attempt tool calls in one study), and it can 'collapse' on certain API types where feedback is hard to interpret.

How do I run GRPO myself?

Hugging Face's TRL library (v1.0) includes GRPO as a first-class method; OpenPipe's ART library packages it specifically for multi-step agent training; and the ms-swift framework supports it across 600+ models.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on GRPO (6)

4Hugging Face Blog·1mo ago·source ↗

Liger GRPO meets TRL: Efficient Reinforcement Learning Training Integration

The Hugging Face blog post announces the integration of Liger Kernel's GRPO (Group Relative Policy Optimization) implementation with TRL (Transformer Reinforcement Learning library). This combination aims to improve memory efficiency and training throughput for RL-based fine-tuning of language models. The integration targets practitioners running GRPO-style training on constrained hardware budgets.

5arXiv · cs.CL·1mo ago·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

7arXiv · cs.AI·29d ago·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

5Github Trending·28d ago·source ↗

OpenPipe ART: Agent Reinforcement Trainer for Multi-Step Agents via GRPO

OpenPipe has released ART (Agent Reinforcement Trainer), an open-source Python library for training multi-step agents on real-world tasks using GRPO (Group Relative Policy Optimization). The framework supports multiple model families including Qwen3, GPT-OSS, and Llama. With nearly 10k GitHub stars and 66 gained today, it is gaining notable community traction as a practical RL fine-tuning tool for agentic workflows.

6arXiv · cs.CL·25d ago·source ↗

Peak-Then-Collapse: RLVR Tool-Use Failures on Knowledge-Graph APIs

This paper investigates RLVR-based tool-use training (GRPO on Qwen2.5-7B-Instruct) on a minimal knowledge-graph API (Freebase over Complex WebQuestions) and documents a 'peak-then-collapse' pattern where tool-grounded answer rates rise then fall to zero within 50 steps, replicated across four seeds and seven reward designs. The authors identify a key structural difference between knowledge-graph APIs and other tool types (Python, web search, JSON): sparse, non-natural-language feedback signals (e.g., empty brackets '[]') prevent the model from recovering via pretraining-familiar error signals. A direct oracle ablation shows relation selection is not the bottleneck—95.4% of errors are retrieval-composition failures—and self-distillation reaches 40% EM at 7B, with capacity scaling to 14B yielding only marginal gains, suggesting an interface-bound ceiling.

6arXiv · cs.CL·25d ago·source ↗

Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

This paper investigates why NLI-based claim checkers used as process rewards in RL-trained medical RAG agents succeed or fail during training. The authors find that a checker's output distribution during training—not its held-out accuracy—determines whether it provides useful gradient signal, with LLM log-probability scoring causing near-total signal collapse (97%+ neutral labels) while a calibrated MedNLI classifier avoids this. A key finding is that stronger checkers can trigger reward hacking cascades (ultra-short answers, search avoidance, language collapse), while moderate-signal local classifiers yield better final model quality (+12% BERTScore over zero-shot). The work frames these as boundary conditions for verifier-as-reward systems in RLVR pipelines.