Entity · technique

GRPO (Group Relative Policy Optimization)

techniqueactivegrpo-group-relative-policy-optimization--5937d22d·17 events·first seen May 18, 2026

Aliases: GRPO (Group Relative Policy Optimization), Group Relative Policy Optimization

Co-occurring entities

More like this (12)

GRPO GSPO (Group Sequence Policy Optimization)Group-in-Group Policy Optimization GraphGPO AdvGRPO N-GRPO Off-Context GRPO Latent-Anchored GRPO Flow-GRPO GraphPO: Graph-based Policy Optimization for Reasoning Models IH-GRPO Hierarchical Relative Policy Optimization

Guides (1)

GRPO (Group Relative Policy Optimization)Concept

GRPO: The Reinforcement Learning Trick Behind Smarter AI Reasoning

Read asBeginner In-depth

Recent events (17)

6arXiv · cs.LG·32h ago·source ↗

APO: Unsupervised atomic policy optimization for 3D structure prediction outperforms supervised baselines

Researchers introduce Atomic Policy Optimization (APO), an unsupervised alignment framework for predicting 3D structures of atomic systems (crystals, antibodies) without requiring ground-truth coordinate labels. APO adapts group-relative policy optimization to 3D atomic environments using a dual-reward mechanism: an eigen-decomposition-based reward reinforcing dominant latent structural modes, and a thermodynamic stability reward. Benchmarks on crystal and antibody structure prediction show APO surpasses fully supervised baselines on match rates and structural fidelity while also improving inference efficiency by straightening probability paths. The work is significant for material science and drug discovery applications where experimental labels are scarce or prohibitively expensive.

Alignment and RLHF FlowDPO APO: Unsupervised Atomic Policy Optimization for 3D Structure Prediction of Atomic Systems GRPO (Group Relative Policy Optimization)+1 more

5arXiv · cs.AI·3d ago·source ↗

LLM framework for instance-wise OR formulation selection improves multi-warehouse inventory allocation at JD.com

Researchers propose a solver-guided LLM framework that selects among a library of MIP formulations for multi-warehouse inventory allocation on a per-instance basis, rather than applying a single fixed formulation. The system is trained via SFT, IPO preference optimization, and GRPO reinforcement learning using MIP solver evaluations as reward signals. Evaluated on real JD.com data, GRPO raises Hit Ratio@1 from 21.45% to 50.42% and achieves a 12.57 percentage point allocation accuracy gain over the incumbent baseline. The work demonstrates a practical pattern of using LLMs as meta-selectors over classical OR solvers in industrial logistics settings.

Enterprise Deployment Patterns Agent and Tool Ecosystem Large Language Model for Operations Research Formulation Selection in Multi-Warehouse Inventory Allocation Identity Preference Optimization GRPO (Group Relative Policy Optimization)+1 more

5arXiv · cs.CL·Jul 23, 2026·source ↗

PyroDash: Token-level SLM-LLM collaborative inference with cost-aware policy learning

PyroDash is a framework that trains a small language model to emit control tokens during generation, triggering selective handoffs to a large LLM only when needed. The SLM policy is learned via a three-stage process culminating in Group Relative Policy Optimization with a reward balancing accuracy against normalized inference cost. On five mathematical reasoning benchmarks, the system achieves accuracy above LLM-only baselines while reducing cost by 20%+ at moderate settings, or cuts LLM calls to 0.012 per example at aggressive cost targets. The approach requires no separate router, LLM retraining, or access to LLM logits.

Inference Economics Agent and Tool Ecosystem GRPO (Group Relative Policy Optimization)PyroDash

5arXiv · cs.CL·Jul 22, 2026·source ↗

MedDDC-Eval: Diagnosis-Decoupled Evaluation Framework for Multi-Turn Medical Consultation Agents

Researchers introduce MedDDC-Eval, a benchmark framework that decouples history-elicitation quality from diagnosis generation in multi-turn medical consultation agents, using a shared frozen reader to hold the history-to-diagnosis mapping constant. The paper demonstrates that varying only the diagnostic reader shifts diagnosis F1 by 2.2–19.0 points and reverses 18–36% of pairwise policy orderings, exposing confounds in coupled evaluation. The authors also apply Group Relative Policy Optimization (GRPO) to post-train Qwen3-32B using diagnosis-result and trajectory feedback, achieving 9.7 and 4.6 total-score point improvements on two evaluation splits. The work addresses a methodological gap in evaluating medical dialogue agents by enabling controlled attribution of policy quality.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3 32B GRPO (Group Relative Policy Optimization)MedDDC-Eval

5arXiv · cs.CL·Jul 21, 2026·source ↗

DeLIVeR: Reinforced knowledge graph exploration for LLM fact-checking

DeLIVeR is a new framework for automated fact-checking that decomposes complex claims into targeted questions and traverses structured Knowledge Graphs for evidence retrieval, optimizing a Planner LLM via Group Relative Policy Optimization (GRPO). Evaluated on LIAR, FEVER, and PolitiFact benchmarks using Qwen2.5-7B, the system achieves F1-scores of 83.73, 84.57, and 79.70 respectively, representing a 10-15% improvement over HippoRAG2. The approach addresses 'query brittleness' in traditional retrieval by framing evidence gathering as a reinforced strategic exploration task, yielding auditable reasoning paths.

Evaluation and Benchmarking Agent and Tool Ecosystem LIAR Qwen2.5-7B PolitiFact +4 more

4arXiv · cs.CL·Jul 20, 2026·source ↗

ToolSciVer: Tool-augmented reinforcement learning for multimodal scientific claim verification

Researchers introduce ToolSciVer, a framework that equips vision-language models with three type-aware visual tools (table focus, chart-to-structure parsing, high-resolution zoom) to verify scientific claims grounded in figures, tables, and charts from papers. The policy is trained using Group Relative Policy Optimization (GRPO) with a composite reward covering correctness, format, tool-use efficiency, and validity. Experiments across five VLMs from three model families (Qwen, InternVL, Gemma) on SciVer and MuSciClaims benchmarks show improvements over prompting-based and RL-based baselines. The work is notable as the first tool-augmented framework specifically targeting multimodal scientific claim verification.

Evaluation and Benchmarking Agent and Tool Ecosystem InternVL MuSciClaims Gemma +5 more

6arXiv · cs.CL·Jul 15, 2026·source ↗

GRPO fails to improve small web agents when supervised baseline is near-ceiling, study finds

A controlled ablation study across 18 runs tests whether GRPO reinforcement learning adds capability to 4B–8B scale language and vision-language model web agents on top of a strong supervised baseline. The result is a credible null: GRPO does not improve success rates when the supervised model has largely mastered the task distribution, and moderate-to-high learning rates actively degrade text-track performance. The authors identify the mechanism — GRPO only helps when sampling headroom exists (sampled policy succeeds more than greedy), and failure modes dissociate into attention/MLP degradation versus full collapse regimes. Effective rank in late layers tracks capability at 4B but not 8B, flagging a scale-dependent coupling.

Evaluation and Benchmarking Agent and Tool Ecosystem A Learning-Rate-Gated Failure of GRPO in a Small Language and Vision-Language Model Web Agent GRPO (Group Relative Policy Optimization)Set-of-Marks +1 more

5arXiv · cs.CL·Jul 10, 2026·source ↗

GRPO outperforms SFT for ASR adaptation using only synthetic speech

A new arXiv preprint demonstrates that Group Relative Policy Optimization (GRPO) substantially outperforms supervised fine-tuning (SFT) when adapting LLM-based automatic speech recognition to regulated domains using only synthetic TTS data. GRPO alone reduces word error rate by 40% relative to SFT (36.71% → 22.09%), with an SFT+GRPO combination achieving 45% relative reduction. The authors attribute gains to behavioral changes — improved stopping calibration and better audio-text alignment — rather than representational shifts in early layers.

Evaluation and Benchmarking Alignment and RLHF GRPO (Group Relative Policy Optimization)Better Call GRPO

6arXiv · cs.CL·Jul 9, 2026·source ↗

AdaPrefix-GRPO: Adaptive prefix control doubles GRPO accuracy on hard math reasoning

A new arXiv preprint introduces AdaPrefix-GRPO, a method that addresses GRPO's failure to learn from problems where no rollout succeeds by prepending correct solution prefixes and dynamically adjusting prefix length to maintain ~50% success rate per problem throughout training. The prefix assistance is gradually withdrawn so the final model solves problems unaided. On hard math benchmarks, the method more than doubles GRPO accuracy for a 0.6B model (2.1x) and achieves 1.7x improvement on AIME, while halving trace length, with larger gains on smaller models. The implementation requires only data preparation changes and a loss mask, leaving the trainer unchanged.

Evaluation and Benchmarking Alignment and RLHF AIME Max Out GRPO Signal: Adaptive Trace Prefix Control for Hard Reasoning Problems Qwen3-1.7B +2 more

7The Batch·Jun 26, 2026·source ↗

Z.ai releases GLM-5.2, a 753B MoE open-weights model claiming top open-model ranking on agentic coding benchmarks

Z.ai released GLM-5.2, a 753-billion-parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, with a 1-million-token input context and MIT license. The model ranks first among open-weights models on Artificial Analysis's Intelligence Index v4.1 (score 51, behind Claude Opus 4.8 at 56 and GPT-5.5 at 55) and leads all models on PostTrainBench, a benchmark for agentic fine-tuning tasks. Key technical contributions include a modified sparse attention indexer applied every four layers (cutting per-token computation 2.9x at 1M context), a switch from GRPO to PPO for long-horizon RL training, and a reward-hacking mitigation pipeline using rule-based filters and a judge model. API pricing is substantially below comparable proprietary models, and the release coincides with U.S. government restrictions on access to Anthropic's frontier models.

Open Weights Progress Inference Economics Artificial Analysis Intelligence Index AA-Briefcase DeepSeek V4 +14 more

5arXiv · cs.CL·Jun 23, 2026·source ↗

Adaptive Data Scheduling (ADS) improves LLM reinforcement learning post-training by 5.2% over GRPO

Researchers propose Adaptive Data Scheduling (ADS), a dual-level framework that replaces uniform sampling in RL post-training with adaptive distribution over semantic clusters and policy-boundary sample selection. Evaluated across three LLMs and seven reasoning benchmarks, ADS improves average accuracy by 5.2% over GRPO and generalizes across RL objectives. The method addresses a structural limitation in standard RL post-training pipelines by accounting for semantic data structure and evolving policy capability during training.

Evaluation and Benchmarking Alignment and RLHF Adaptive Data Scheduling GRPO (Group Relative Policy Optimization)

4arXiv · cs.CL·Jun 23, 2026·source ↗

P4IR framework uses SFT + GRPO to improve LLM-based automated building code compliance

Researchers introduce P4IR, a two-stage framework combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to improve LLM accuracy in automated code compliance (ACC) for building regulations. The approach reduces tree edit distance and token-level Levenshtein distance by up to 23.8% and 38.6% respectively versus SFT baselines, and outperforms Claude Opus/Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7 in zero-shot settings. The work targets a narrow but practically important domain where LLM hallucinations carry real regulatory consequences.

Enterprise Deployment Patterns Alignment and RLHF GPT-5.2 Claude Opus 4.6 Claude Sonnet 4.5 +4 more

5arXiv · cs.CL·Jun 15, 2026·source ↗

CORA: Consistency-Oriented Reasoning Alignment addresses thinking-answer gap in multimodal RLVR

Researchers identify and analyze a systematic inconsistency between reasoning traces and final answers in RLVR-trained large vision-language models, showing the problem persists throughout GRPO training and inference. They propose CORA, which introduces a lightweight plug-and-play consistency reward model and a Hybrid Reward Advantage Splitting (HRAS) mechanism to coordinate task and consistency optimization. Experiments across multimodal reasoning benchmarks show CORA improves both task performance and reasoning faithfulness.

Evaluation and Benchmarking Alignment and RLHF CORA Hybrid Reward Advantage Splitting GRPO (Group Relative Policy Optimization)+1 more

4arXiv · cs.CL·Jun 10, 2026·source ↗

N-GRPO: Semantic Neighbor Mixing for Improved Policy Optimization in LLM Reasoning

A new arXiv preprint introduces N-GRPO, an exploration strategy for the GRPO reinforcement learning framework that improves solution diversity during rollout by mixing embeddings of anchor tokens with their nearest semantic neighbors rather than using token-level sampling or random noise. The method is evaluated on DeepSeek-R1-Distill-Qwen models of various sizes and shows consistent improvements on math reasoning benchmarks plus out-of-distribution generalization. The work targets a known limitation in RLHF-style training: redundant rollout trajectories that reduce effective learning signal.

Evaluation and Benchmarking Alignment and RLHF N-GRPO DeepSeek-R1-Distill-Qwen Semantic Neighbor Mixing +1 more

7arXiv · cs.CL·Jun 10, 2026·source ↗

One-shot GRPO training on a single biased example can break LLM alignment

A new arXiv paper demonstrates that a single biased training example using Group Relative Policy Optimization (GRPO) is sufficient to induce systematic bias in aligned LLMs, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. The authors find that model susceptibility varies based on the initial likelihood of producing biased outputs. The result exposes a critical vulnerability in post-training alignment: a minimal fine-tuning intervention can override safety guardrails.

AI Safety Research Alignment and RLHF It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO GRPO (Group Relative Policy Optimization)

5arXiv · cs.LG·Jun 5, 2026·source ↗

RREDCoT: Segment-level reward redistribution for chain-of-thought reasoning via self-approximated credit assignment

RREDCoT is a new method for redistributing rewards across segments of Chain-of-Thought traces during RL fine-tuning of reasoning language models, addressing the high-variance delayed-reward problem inherent in GRPO-style training. Rather than using computationally expensive Monte Carlo sampling for intermediate state value estimation, the method uses the model itself to approximate optimal reward redistribution without additional generation passes. The paper evaluates RREDCoT against MC sampling and several attribution baselines, analyzing segmentation strategies and state value estimation. This is relevant to the active research thread on improving RL fine-tuning stability and efficiency for reasoning models.

Alignment and RLHF RREDCoT: Segment-Level Reward Redistribution for Reasoning Models Chain-of-Thought Reasoning GRPO (Group Relative Policy Optimization)+1 more

7Qwen Research·May 18, 2026·source ↗

GSPO: Group Sequence Policy Optimization for Scalable RL Training of Language Models

Qwen researchers introduce Group Sequence Policy Optimization (GSPO), a new RL algorithm designed to address severe training instability and model collapse observed in existing methods like GRPO during extended training runs. The core motivation is enabling stable RL scaling for language models to improve reasoning and problem-solving capabilities with increased compute. The paper targets a known bottleneck in post-training pipelines where instability prevents further performance gains.

Training Infrastructure Frontier Model Releases Qwen GSPO (Group Sequence Policy Optimization)GRPO (Group Relative Policy Optimization)+2 more