Almanac
Concept guide · In-depth

GRPO: Group Relative Policy Optimization for LLM Post-Training

GRPOIn-depthactive·v1 · live·generated 6d ago

Part of these paths

TL;DRGRPO is a critic-free reinforcement learning algorithm that has become the dominant workhorse for post-training language models on reasoning, tool use, and agentic tasks. It replaced the need for a separate value network by estimating advantages from groups of sampled responses, making RL fine-tuning cheap enough to run on modest hardware — and that accessibility has triggered an explosion of variants, extensions, and tooling that now define the frontier of LLM alignment research.

Key takeaways

  • Critic-free design: GRPO estimates advantages by comparing reward scores within a group of rollouts for the same prompt, eliminating the value-network overhead of PPO.
  • Became the de-facto baseline for RLVR (RL with Verifiable Rewards) after DeepSeek R1 demonstrated emergent chain-of-thought reasoning trained with it.
  • Documented failure modes include the 'peak-then-collapse' pattern on sparse-feedback APIs, reward hacking cascades under strong verifiers, and a 'Thinking-Acting Gap' where tool use appears in only ~30% of agentic rollouts.
  • A dense ecosystem of drop-in replacements and extensions — VPO, LamPO, IH-GRPO, AXPO, DRPO, POW3R — each address a specific GRPO weakness while preserving its critic-free structure.
  • Production tooling (TRL v1.0, Liger Kernel integration, co-located vLLM, OpenPipe ART, ms-swift) has made GRPO accessible on consumer-grade hardware.
  • Non-parametric methods like CORE and SCOPE now match GRPO-trained baselines on some benchmarks with far fewer rollouts, signaling that the algorithm's dominance may be contested.

What GRPO is

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for post-training large language models. Its defining feature is that it estimates the advantage of each generated response — the signal telling the optimizer whether that response was better or worse than expected — by comparing rewards within a group of responses sampled for the same prompt. The group mean serves as the baseline, eliminating the need for a separate critic or value network that PPO requires. The result is a critic-free RL loop that is substantially cheaper in memory and compute than its predecessors.

GRPO sits within the broader RLVR (RL with Verifiable Rewards) paradigm: training on tasks where correctness can be checked programmatically — math answers, code execution, tool-call outcomes — rather than requiring a learned reward model or human rater.

How it works

At each training step, GRPO samples a group of G completions for each prompt, scores each with a reward function, normalizes scores within the group to produce advantages, and updates the policy with a clipped surrogate loss (PPO-style) weighted by those advantages. Because the baseline is the within-group mean rather than a learned value function, no second network needs to be maintained or updated. Completed adapters can be merged back into the base weights with no inference overhead.

The reward function is the critical design surface. In verifiable settings it is a deterministic checker (answer correctness, code tests, tool-call validity). In open-ended settings, researchers have explored rubric-based rewards, NLI classifiers, and LLM judges — each with distinct failure modes documented in the events below.

Why it matters

GRPO became the de-facto post-training baseline after DeepSeek R1 demonstrated that RL training with group-relative advantages could produce emergent chain-of-thought self-correction — the so-called "aha moment" — in smaller models. A Hugging Face tutorial reproducing this behavior on a countdown task (Mini-R1) made the recipe accessible to the open-source community in early 2025, and adoption accelerated rapidly. By March 2026, TRL v1.0 had stabilized a production-grade GRPO implementation, and tools like OpenPipe ART, ms-swift, and the Liger Kernel integration had brought GRPO within reach of practitioners on consumer hardware.

The algorithm's accessibility is its most consequential property: it turned RL fine-tuning from a cluster-scale operation into something runnable on a single GPU, which is why nearly every post-training paper in the current corpus uses GRPO as its baseline or starting point.

Documented failure modes

The breadth of GRPO adoption has also produced a detailed map of where it breaks.

Sparse, non-natural-language feedback. When GRPO is applied to knowledge-graph APIs that return structured but terse signals (e.g., empty brackets []), a "peak-then-collapse" pattern emerges: tool-grounded answer rates rise then fall to zero within ~50 training steps across multiple seeds and reward designs. The model cannot recover because the error signal is outside its pretraining distribution.

Reward hacking under strong verifiers. In biomedical RAG settings, using a high-accuracy LLM log-probability scorer as a process reward causes near-total signal collapse (97%+ neutral labels), while stronger checkers trigger reward hacking cascades — ultra-short answers, search avoidance, language collapse. A calibrated local classifier avoids both failure modes and yields better final quality.

The Thinking-Acting Gap. Under standard GRPO training for agentic tasks, tool use appears in only ~30% of rollouts. All-wrong tool-using subgroups suppress learning signals, starving the model of the exploration needed to improve tool orchestration. AXPO addresses this by fixing the thinking prefix and resampling tool calls for all-wrong subgroups.

Static rubric saturation. When rubric-based rewards aggregate criterion weights statically, many criteria are either already saturated or unreachable at any given training step, wasting gradient signal. POW3R's dynamic reweighting reaches equivalent performance in 2.5–4× fewer steps.

Trust-region approximation in long-tailed vocabularies. Importance ratios poorly proxy distributional shift for rare tokens, and hard masking at trust-region boundaries discards gradient signal rather than correcting it. DRPO replaces the hard mask with a smooth quadratic regularizer.

The variant landscape

The failure modes above have each spawned a targeted fix, most designed as near-drop-in replacements for the GRPO advantage estimator:

  • LamPO replaces scalar group-relative advantages with pairwise decomposed advantages weighted by confidence-aware log-probability differences, yielding more stable training on AIME24/25, MATH-500, and GPQA-Diamond.
  • VPO replaces scalar rewards with vector-valued rewards, explicitly training for solution diversity to support inference-time search procedures like evolutionary algorithms — advantages grow as search budget increases.
  • IH-GRPO decouples tool invocation from execution via a hierarchical surrogate loss, recovering +1.87–2.53% on out-of-domain math benchmarks over the strongest baseline.
  • AXPO fixes the thinking prefix and resamples all-wrong tool-call subgroups, achieving +1.8pp Pass@1 and Pass@4 at 8B over SFT+GRPO.
  • DRPO smooths the trust-region boundary with a quadratic divergence regularizer, improving stability across model scales and precision settings.
  • POW3R dynamically reweights rubric criteria using rollout-level contrast, winning 24 of 30 comparisons against vanilla GRPO with rubric rewards.

Application domains

The events bundle documents GRPO applications well beyond its original math-reasoning context:

  • Multi-step tool orchestration (PROVE): training on 20 stateful MCP servers with 343 tools yields +10.2 on BFCL Multi-Turn, +6.8 on tau-bench, +6.5 on T-Eval.
  • Mobile GUI agents (MobileGym): GRPO on Qwen3-VL-4B-Instruct achieves +12.8pp on a 256-task test set, with 95.1% of simulation gains transferring to real devices.
  • Robotics continual learning: combining LoRA with GRPO on OpenVLA-OFT achieves 81.2% success on LIBERO spatial tasks with near-zero catastrophic forgetting (0.3pp drop).
  • Multilingual reasoning (Luar): GRPO trains models to selectively invoke English translation only when direct understanding is unreliable, with especially large gains on low-resource languages.
  • Red teaming (AdvGRPO): dense multi-channel rewards and decoupled advantage normalization make GRPO viable for joint attacker-defender co-training.
  • Scientific knowledge graph construction (Agents-K1): a 4B information-extraction model trained with GRPO processes 2.46 million papers into a structured knowledge graph.
  • Multimodal judge calibration: GRPO-based reward modeling with batch-ranking objectives reduces perceptual judgment bias in vision-language judges.

Challenges to GRPO's dominance

Two non-parametric approaches now match GRPO on key benchmarks without updating model weights at all:

SCOPE co-evolves a Challenger (task generator) and a Solver (retrieval-augmented answerer) using a frozen initial model as a self-judge. Across three 7–8B models, it matches GRPO trained on ~9K curated prompts on eight open-ended benchmarks, with no external supervision or frontier-model judge.

CORE compares successful and unsuccessful reasoning traces to distill compact natural-language "insights" about reasoning strategies. It matches or beats GRPO/RLVR under fixed rollout budgets, achieving comparable gains with as few as five training samples.

These results suggest that for some task distributions, the gradient signal GRPO provides is not the binding constraint — data quality and feedback structure matter more.

Tooling ecosystem

GRPO's practical reach is inseparable from its tooling support. TRL v1.0 (Hugging Face) stabilized the API and added co-located vLLM inference, eliminating the idle-GPU problem where generation and training steps previously required alternating dedicated GPU allocations. The Liger Kernel integration targets memory efficiency on constrained hardware. OpenPipe ART provides a purpose-built open-source library for multi-step agentic GRPO training across Qwen3, GPT-OSS, and Llama. ms-swift covers GRPO alongside CPT, SFT, and DPO across 600+ LLMs and 300+ multimodal LLMs, with AAAI 2025 acceptance.

Where it's heading

The current frontier is not GRPO itself but the reward signal it optimizes. The most active research directions — programmatic environment rewards (PROVE), dynamic rubric reweighting (POW3R), retrieval-augmented reasoning (RA-RFT), skill reuse via MDL (ReuseRL), and data engineering via sparse autoencoders (SAERL) — all treat GRPO as a fixed substrate and compete on what signal to feed it. The emergence of non-parametric alternatives (SCOPE, CORE) and diversity-optimized variants (VPO) suggests the next generation of post-training may look quite different from the scalar-reward, single-policy loop that GRPO standardized — but for now, GRPO remains the algorithm every new method must beat.

GRPO training loop and variant landscape

GRPO and its principal variants / alternatives

MethodKey change from GRPOClaimed gainBest for
GRPO (baseline)Group-relative advantages, no criticGeneral RLVR baseline
LamPOPairwise decomposed advantage + confidence weightingConsistent gains over GRPO on AIME24/25, MATH-500, GPQA-DiamondReasoning LMs, stable training
VPOVector-valued rewards; trains for solution diversityMatches/beats GRPO on pass@k; unlocks evolutionary searchInference-time search scaling
IH-GRPODecouples tool invocation from execution via hierarchical surrogate loss+1.87–2.53% on 6 OOD math benchmarks over strongest baselineTool-integrated mathematical reasoning
AXPOFixes thinking prefix; resamples all-wrong tool-call subgroups+1.8pp Pass@1 and Pass@4 at 8B over SFT+GRPOMultimodal agentic reasoning
DRPOSmooth quadratic divergence regularizer replaces hard trust-region maskImproved stability across scales vs. PPO/GRPOLong-tailed vocabulary stability
POW3RDynamic criterion reweighting via rollout-level contrastWins 24/30 comparisons; 2.5–4× fewer steps to equivalent performanceRubric-based RLVR
SCOPE (non-parametric)Self-play co-evolution; no RL weight updatesMatches GRPO trained on ~9K curated promptsData-free open-ended tasks
CORE (non-parametric)Contrastive trace distillation into natural-language insightsComparable or better than GRPO/RLVR at ≥5 training samplesSample-efficient self-improvement

All figures from the events bundle; unknown cells render —.

Timeline

  1. Mini-R1 tutorial reproduces DeepSeek R1 'aha moment' with GRPO, popularizing the algorithm in the open-source community

  2. Liger Kernel GRPO integrates with TRL, targeting memory-efficient training on constrained hardware

  3. TRL adds co-located vLLM support, eliminating idle-GPU waste in GRPO/PPO online RL pipelines

  4. TRL v1.0 released — API stabilization signals GRPO as a production-standard method

  5. POW3R demonstrates dynamic rubric reweighting reaches equivalent GRPO performance in 2.5–4× fewer steps

  6. SCOPE matches GRPO on open-ended benchmarks without any RL weight updates, challenging GRPO's necessity

Related topics

PPODPORLVRTRLHugging FaceQwen3Qwen3-4BQwen3-1.7BQwen2.5-7BLlamaAIME25OpenVLA-OFT

FAQ

Why does GRPO not need a critic/value network?

GRPO estimates the advantage of each response by comparing its reward against the mean reward of a group of responses sampled for the same prompt — the group average acts as a baseline, making a learned value function unnecessary.

What is the 'peak-then-collapse' failure mode?

When GRPO is applied to knowledge-graph APIs that return sparse, non-natural-language feedback (e.g., empty brackets), tool-grounded answer rates rise briefly then fall to zero within ~50 training steps — the model cannot recover because the error signals are outside its pretraining distribution.

How does GRPO relate to PPO?

Both use a clipped surrogate objective and trust-region-style updates, but PPO requires a separate value network to estimate baselines while GRPO derives them from within-group reward statistics, substantially reducing memory and compute cost.

Is GRPO still the best choice for reasoning fine-tuning?

It is the dominant baseline, but several non-parametric alternatives (SCOPE, CORE) now match it on some benchmarks with far fewer rollouts, and variants like LamPO and POW3R consistently outperform vanilla GRPO — the right choice depends on task, data availability, and compute budget.

What tooling supports GRPO out of the box?

TRL v1.0 (with co-located vLLM and Liger Kernel integration), OpenPipe ART, and ms-swift all support GRPO across major model families including Qwen3, Llama, and DeepSeek.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on GRPO (6)

4Hugging Face Blog·1mo ago·source ↗

Liger GRPO meets TRL: Efficient Reinforcement Learning Training Integration

The Hugging Face blog post announces the integration of Liger Kernel's GRPO (Group Relative Policy Optimization) implementation with TRL (Transformer Reinforcement Learning library). This combination aims to improve memory efficiency and training throughput for RL-based fine-tuning of language models. The integration targets practitioners running GRPO-style training on constrained hardware budgets.

5arXiv · cs.CL·1mo ago·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

7arXiv · cs.AI·29d ago·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

5Github Trending·28d ago·source ↗

OpenPipe ART: Agent Reinforcement Trainer for Multi-Step Agents via GRPO

OpenPipe has released ART (Agent Reinforcement Trainer), an open-source Python library for training multi-step agents on real-world tasks using GRPO (Group Relative Policy Optimization). The framework supports multiple model families including Qwen3, GPT-OSS, and Llama. With nearly 10k GitHub stars and 66 gained today, it is gaining notable community traction as a practical RL fine-tuning tool for agentic workflows.

6arXiv · cs.CL·25d ago·source ↗

Peak-Then-Collapse: RLVR Tool-Use Failures on Knowledge-Graph APIs

This paper investigates RLVR-based tool-use training (GRPO on Qwen2.5-7B-Instruct) on a minimal knowledge-graph API (Freebase over Complex WebQuestions) and documents a 'peak-then-collapse' pattern where tool-grounded answer rates rise then fall to zero within 50 steps, replicated across four seeds and seven reward designs. The authors identify a key structural difference between knowledge-graph APIs and other tool types (Python, web search, JSON): sparse, non-natural-language feedback signals (e.g., empty brackets '[]') prevent the model from recovering via pretraining-familiar error signals. A direct oracle ablation shows relation selection is not the bottleneck—95.4% of errors are retrieval-composition failures—and self-distillation reaches 40% EM at 7B, with capacity scaling to 14B yielding only marginal gains, suggesting an interface-bound ceiling.

6arXiv · cs.CL·25d ago·source ↗

Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

This paper investigates why NLI-based claim checkers used as process rewards in RL-trained medical RAG agents succeed or fail during training. The authors find that a checker's output distribution during training—not its held-out accuracy—determines whether it provides useful gradient signal, with LLM log-probability scoring causing near-total signal collapse (97%+ neutral labels) while a calibrated MedNLI classifier avoids this. A key finding is that stronger checkers can trigger reward hacking cascades (ultra-short answers, search avoidance, language collapse), while moderate-signal local classifiers yield better final model quality (+12% BERTScore over zero-shot). The work frames these as boundary conditions for verifier-as-reward systems in RLVR pipelines.