technique

GraphPO

techniqueactiveprovisionalgraphpo-edc83838·1 events·first seen 2d ago

Aliases: GraphPO

Co-occurring entities

GraphPO: Graph-based Policy Optimization for Reasoning Models

More like this (12)

GRPO N-GRPO LangGraph Latent-Anchored GRPO AdvGRPO GRPO (Group Relative Policy Optimization)GraphPO: Graph-based Policy Optimization for Reasoning Models GSPO (Group Sequence Policy Optimization)GraphReview Graph RAG CodeGraph SimPO

Recent events (1)

6arXiv · cs.CL·2d ago·source ↗

GraphPO: Graph-based Policy Optimization reduces redundancy in LLM reasoning RL

GraphPO is a new reinforcement learning framework that represents reasoning rollouts as directed acyclic graphs rather than independent chains or trees, merging semantically equivalent reasoning paths into equivalence classes to share suffixes and reduce redundant exploration. The approach assigns efficiency advantages to incoming edges and correctness advantages to outgoing edges, deriving process supervision from outcome rewards. Experiments on three LLMs across reasoning and agentic search benchmarks show consistent improvements over chain- and tree-based baselines under equal token or response budgets. The method also provides theoretical guarantees on reduced advantage-estimation variance.

Frontier Model Releases Alignment and RLHF GraphPO GraphPO: Graph-based Policy Optimization for Reasoning Models