APPO: Agentic Procedural Policy Optimization
appo-agentic-procedural-policy-optimization-202bf445·1 events·first seen 6d agoAliases: APPO: Agentic Procedural Policy Optimization, Agentic Procedural Policy Optimization
More like this (12)
Recent events (1)
APPO: Fine-grained branching and credit assignment for agentic RL in LLMs
Researchers introduce Agentic Procedural Policy Optimization (APPO), a reinforcement learning method that shifts branching and credit assignment from coarse tool-call boundaries to fine-grained decision points within generated sequences. APPO uses a Branching Score combining token uncertainty with policy-induced likelihood gains to select exploration points, plus procedure-level advantage scaling for credit distribution. Evaluated on 13 benchmarks, APPO improves strong agentic RL baselines by nearly 4 points while maintaining efficient tool use and interpretability. The work addresses a known weakness in multi-turn agentic RL: that influential decisions are distributed throughout sequences, not concentrated at tool-call boundaries.