What it is
Proximal Policy Optimization (PPO) is a policy-gradient reinforcement learning algorithm introduced by OpenAI in July 2017. Its core idea is simple: when updating a policy from collected experience, clip the ratio between the new and old action probabilities so that no single update moves the policy too far. This "proximal" constraint — enforced cheaply through a clipped surrogate loss rather than an expensive second-order trust-region calculation — lets practitioners run multiple gradient epochs over the same batch of experience without destabilizing training. OpenAI adopted PPO as its default RL algorithm on release, citing its balance of performance and ease of use.
How it works
The standard PPO objective clips the probability ratio $r_t(\theta) = \pi_\theta(a|s) / \pi_{\theta_\text{old}}(a|s)$ at $[1-\epsilon, 1+\epsilon]$, then takes the minimum of the clipped and unclipped objectives to form a pessimistic bound on policy improvement. A separate value network (critic) is trained in parallel to estimate returns, providing a variance-reducing baseline for the policy gradient. In practice, PPO alternates between collecting rollouts under the current policy and running several epochs of minibatch gradient descent on the combined actor-critic loss — a pattern that makes it straightforward to parallelize across many environment workers.
`` Collect rollouts → Compute advantages → Clip ratio → Update actor + critic → Repeat ``
The KL divergence between old and new policy is monitored (and sometimes penalized directly) to catch cases where the clipping bound is insufficient.
Why it matters
PPO's significance extends well beyond its original game-playing context. Two trajectories define its impact:
Game-playing at scale. OpenAI Five — the system that defeated amateur Dota 2 teams in 2018 and world champions in 2019 — was trained using large-scale distributed PPO with self-play. Dota 2 is a long-horizon, high-dimensional, multi-agent environment; the fact that PPO scaled to superhuman performance there established it as viable for the hardest RL problems of its era.
RLHF for language models. The standard pipeline for aligning large language models with human preferences — pretrain, collect preference data, train a reward model, fine-tune with RL — uses PPO as its optimizer. PPO's stability under a KL penalty (which keeps the fine-tuned model close to the base) and its value network (which reduces gradient variance over long token sequences) make it well-suited to the LLM fine-tuning regime. A 2022 Hugging Face overview established this pipeline as the canonical RLHF recipe; a 2023 practitioner reference from the same source cataloged the low-level engineering details — reward normalization, KL penalty scheduling, value function initialization, batch construction — that papers routinely omit but that determine whether training actually converges.
Variants and alternatives
The post-training landscape has diversified considerably since 2017, but PPO remains the reference point:
- GRPO (Group Relative Policy Optimization) eliminates the value network by normalizing rewards within a group of rollouts for the same prompt. This reduces memory and compute but removes the baseline that helps PPO handle sparse or delayed rewards. Z.ai's GLM-5.2 (June 2026) switched from GRPO back to PPO for long-horizon agentic coding RL, explicitly citing stability advantages — a concrete data point on where GRPO's tradeoffs bite.
- DPO (Direct Preference Optimization) bypasses the RL loop entirely, framing preference alignment as a supervised objective. It is simpler to implement but requires a different data format and cannot easily incorporate process-level or outcome-based reward signals.
- ZPPO (Zone of Proximal Policy Optimization, June 2026) addresses a specific PPO failure mode: when a small model's rollouts all receive zero reward, no gradient signal flows. ZPPO embeds teacher guidance directly in prompts (Binary and Negative Candidate-included Questions) and uses a replay buffer for hard questions, outperforming both distillation and GRPO baselines on the Qwen3 family (0.8B–9B) across a 31-benchmark suite — with the largest gains at the smallest model scale.
PPO in robotics and agentic systems
Beyond LLMs, PPO remains the optimizer of choice in continuous-control settings. AgenticRL (June 2026) uses PPO inside a closed-loop framework where a multimodal GPT agent generates reward functions, trains policies, and iteratively refines them for UAV navigation — achieving a 71% improvement in policy behavior over initial rewards and 91% real-world success. CoorDex (June 2026), a pipeline for dexterous humanoid loco-manipulation, trains coordinated residual RL policies using PPO-style updates to compose latent priors for body and hand motion on a Unitree G1 robot.
Implementation tradeoffs and pitfalls
PPO's practical complexity is higher than its clean objective suggests. The Hugging Face practitioner reference identifies several critical knobs for RLHF stability:
- Reward normalization: unnormalized rewards cause value function divergence.
- KL penalty scheduling: too tight and the model doesn't learn; too loose and it reward-hacks.
- Value function initialization: initializing from the LM's final layer rather than randomly is often essential.
- Batch construction: mixing on-policy and off-policy data, or using stale rollouts, degrades performance in ways that are hard to diagnose.
Z.ai's GLM-5.2 also deployed a reward-hacking mitigation pipeline — rule-based filters plus a judge model — alongside PPO, underscoring that the algorithm alone does not prevent reward exploitation in long-horizon settings.
Where it's heading
PPO's longevity is a function of its stability properties, which matter most precisely where newer, lighter alternatives struggle: long-horizon tasks, sparse rewards, and settings where a value baseline is worth its memory cost. The 2026 evidence — GLM-5.2 reverting from GRPO, ZPPO patching its zero-reward failure mode, robotics pipelines continuing to rely on it — suggests PPO is not being displaced so much as specialized around. Expect it to remain the default for high-stakes, long-horizon RL while lighter methods (GRPO, DPO) handle the dense-reward, short-horizon alignment cases where their simplicity wins.




