6arXiv cs.CL (Computation and Language)·1mo ago

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

This paper introduces IH-GRPO, a reinforcement learning algorithm that decouples tool invocation from immediate execution during LLM reasoning, addressing the coherence disruption caused by tight coupling in existing tool-integrated reasoning (TIR) approaches. The authors propose a hierarchical control framework and derive a surrogate loss enabling an implicitly hierarchical policy to match the behavior of an explicit hierarchical policy. Experiments on Qwen3 models (1.7B, 4B, 8B) show absolute improvements of 1.87–2.53% across six out-of-domain mathematical reasoning benchmarks over the strongest baseline. Code is publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF GRPO Tool-Integrated Reasoning Qwen3-4B IH-GRPO Qwen3-1.7B

Related guides (4)

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Read asIn-depth

Related events (8)

6arXiv · cs.CL·2d ago·source ↗

GraphPO: Graph-based Policy Optimization reduces redundancy in LLM reasoning RL

GraphPO is a new reinforcement learning framework that represents reasoning rollouts as directed acyclic graphs rather than independent chains or trees, merging semantically equivalent reasoning paths into equivalence classes to share suffixes and reduce redundant exploration. The approach assigns efficiency advantages to incoming edges and correctness advantages to outgoing edges, deriving process supervision from outcome rewards. Experiments on three LLMs across reasoning and agentic search benchmarks show consistent improvements over chain- and tree-based baselines under equal token or response budgets. The method also provides theoretical guarantees on reduced advantage-estimation variance.

Frontier Model Releases Alignment and RLHF GraphPO GraphPO: Graph-based Policy Optimization for Reasoning Models

4arXiv · cs.CL·11d ago·source ↗

N-GRPO: Semantic Neighbor Mixing for Improved Policy Optimization in LLM Reasoning

A new arXiv preprint introduces N-GRPO, an exploration strategy for the GRPO reinforcement learning framework that improves solution diversity during rollout by mixing embeddings of anchor tokens with their nearest semantic neighbors rather than using token-level sampling or random noise. The method is evaluated on DeepSeek-R1-Distill-Qwen models of various sizes and shows consistent improvements on math reasoning benchmarks plus out-of-distribution generalization. The work targets a known limitation in RLHF-style training: redundant rollout trajectories that reduce effective learning signal.

Evaluation and Benchmarking Alignment and RLHF N-GRPO DeepSeek-R1-Distill-Qwen Semantic Neighbor Mixing +1 more

7arXiv · cs.CL·23d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more

4Hugging Face Blog·1mo ago·source ↗

Liger GRPO meets TRL: Efficient Reinforcement Learning Training Integration

The Hugging Face blog post announces the integration of Liger Kernel's GRPO (Group Relative Policy Optimization) implementation with TRL (Transformer Reinforcement Learning library). This combination aims to improve memory efficiency and training throughput for RL-based fine-tuning of language models. The integration targets practitioners running GRPO-style training on constrained hardware budgets.

Inference Economics Agent and Tool Ecosystem Liger Kernel GRPO Hugging Face +2 more

6The Batch·34h ago·source ↗

POPE Training Method Uses Partial Solution Hints to Improve RL Exploration in LLMs

Researchers from Carnegie Mellon University introduced Privileged On-Policy Exploration (POPE), a training method that pairs GRPO reinforcement learning with hint-augmented datasets to help LLMs solve hard problems they would otherwise fail to explore. During training, the model receives partial solution prefixes alongside full problems, enabling it to discover complete solutions; it is then trained on both hinted and unhinted versions so it learns to solve problems without hints at inference time. On competition math benchmarks AIME 2025 and HMMT 2025, POPE outperforms standard GRPO and supervised fine-tuning, with HMMT pass@1 improving from 31.0% to 37.8%. The method addresses a core bottleneck in RL training—sparse reward exploration—by decomposing hard problem-solving into finding a good starting state and completing the solution.

Evaluation and Benchmarking Alignment and RLHF Virginia Smith Carnegie Mellon University Aviral Kumar +8 more

5arXiv · cs.CL·1mo ago·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

Frontier Model Releases Evaluation and Benchmarking RLVR ROUGE-L AIME24 +10 more

6arXiv · cs.CL·11d ago·source ↗

Gravity-Weighted DPO enforces multi-level instruction hierarchies in LLMs

Researchers introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective that scales per-sample loss offsets by the structural distance between conflicting instruction levels, addressing the problem of uniform architectural privilege across trust levels in production LLMs. The work formalizes a 5-level instruction hierarchy with ten pairwise priority relations and combines GW-DPO with hierarchy-specific delimiter tokens and Instructional Segment Embeddings (ISE). Evaluated on Llama-3.1-8B-Instruct, the bilateral GW-DPO schedule Pareto-improves over standard DPO on macro pairwise priority adherence while cutting over-refusal rates in half. The approach directly targets prompt injection vulnerabilities arising from models' inability to resolve competing instructions by privilege level.

AI Safety Research Agent and Tool Ecosystem Instructional Segment Embeddings Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization Llama3-8B-Instruct +3 more

7Qwen Research·1mo ago·source ↗

GSPO: Group Sequence Policy Optimization for Scalable RL Training of Language Models

Qwen researchers introduce Group Sequence Policy Optimization (GSPO), a new RL algorithm designed to address severe training instability and model collapse observed in existing methods like GRPO during extended training runs. The core motivation is enabling stable RL scaling for language models to improve reasoning and problem-solving capabilities with increased compute. The paper targets a known bottleneck in post-training pipelines where instability prevents further performance gains.

Training Infrastructure Frontier Model Releases Qwen GSPO (Group Sequence Policy Optimization)GRPO (Group Relative Policy Optimization)+2 more