Almanac
paper

APPO: Agentic Procedural Policy Optimization

paperactiveprovisionalappo-agentic-procedural-policy-optimization-202bf445·1 events·first seen 6d ago

Aliases: APPO: Agentic Procedural Policy Optimization, Agentic Procedural Policy Optimization

More like this (12)

Recent events (1)

6arXiv · cs.LG·6d ago·source ↗

APPO: Fine-grained branching and credit assignment for agentic RL in LLMs

Researchers introduce Agentic Procedural Policy Optimization (APPO), a reinforcement learning method that shifts branching and credit assignment from coarse tool-call boundaries to fine-grained decision points within generated sequences. APPO uses a Branching Score combining token uncertainty with policy-induced likelihood gains to select exploration points, plus procedure-level advantage scaling for credit distribution. Evaluated on 13 benchmarks, APPO improves strong agentic RL baselines by nearly 4 points while maintaining efficient tool use and interpretability. The work addresses a known weakness in multi-turn agentic RL: that influential decisions are distributed throughout sequences, not concentrated at tool-call boundaries.