GraphPO: Graph-based Policy Optimization for Reasoning Models
graphpo-graph-based-policy-optimization-for-reasoning-models-368d6c43·1 events·first seen 2d agoAliases: GraphPO: Graph-based Policy Optimization for Reasoning Models
Co-occurring entities
More like this (12)
Recent events (1)
GraphPO: Graph-based Policy Optimization reduces redundancy in LLM reasoning RL
GraphPO is a new reinforcement learning framework that represents reasoning rollouts as directed acyclic graphs rather than independent chains or trees, merging semantically equivalent reasoning paths into equivalence classes to share suffixes and reduce redundant exploration. The approach assigns efficiency advantages to incoming edges and correctness advantages to outgoing edges, deriving process supervision from outcome rewards. Experiments on three LLMs across reasoning and agentic search benchmarks show consistent improvements over chain- and tree-based baselines under equal token or response budgets. The method also provides theoretical guarantees on reduced advantage-estimation variance.