What it is
Proximal Policy Optimization (PPO) is a policy-gradient reinforcement learning algorithm introduced by OpenAI in July 2017. Its core idea is a clipped surrogate objective: rather than allowing an unconstrained policy update — which can destabilize training by moving too far from the current policy — PPO clips the importance ratio between the new and old policy to keep updates within a trust region. This achieves the stability benefits of earlier trust-region methods (like TRPO) without their second-order optimization overhead, making PPO both performant and practical.
OpenAI adopted it as their default RL algorithm on release, citing its balance of ease of implementation, ease of tuning, and strong empirical performance across a wide range of tasks.
How it works
The training loop has three phases that repeat:
1. Rollout — the current policy generates trajectories (sequences of actions and observations). 2. Advantage estimation — a separate value network (the critic) estimates how much better or worse each action was than expected, producing an advantage signal. 3. Policy update — the policy (actor) is updated to increase the probability of high-advantage actions, but only within the clipped trust region. The clip prevents the importance ratio r(θ) = π_new(a|s) / π_old(a|s) from straying too far from 1, discarding gradient signal outside the boundary.
The critic is trained in parallel to minimize the value prediction error. This actor-critic structure is PPO's most significant engineering cost: it doubles the model footprint and adds a second optimization target.
Why it matters
PPO became the load-bearing algorithm of the RLHF (Reinforcement Learning from Human Feedback) era. The canonical alignment pipeline — supervised fine-tuning → reward model training → PPO optimization against the reward model — was documented and popularized through practical guides like Hugging Face's StackLLaMA tutorial, which walked through the full workflow on Meta's LLaMA using the TRL library. PPO's role here is to update the language model's policy to maximize reward-model scores while a KL penalty prevents it from drifting too far from the supervised baseline.
Beyond language models, PPO demonstrated its range in game-playing: an imitation-seeded curriculum using PPO achieved a score of 74,500 on Montezuma's Revenge — a notoriously hard-exploration Atari game — from a single human demonstration, surpassing all prior published results.
Variants and alternatives
The post-training landscape has fragmented significantly since PPO's RLHF debut, with every major alternative defining itself in relation to PPO's known costs and failure modes:
- GRPO eliminates the critic entirely by computing group-relative advantages across a batch of sampled responses. It is now a co-equal method in TRL v1.0 alongside PPO and DPO, and has become the default for many LLM reasoning fine-tunes.
- DPO sidesteps the RL loop altogether, framing preference learning as a supervised classification problem. It requires no reward model at inference and no online rollouts, at the cost of being offline.
- RLOO (REINFORCE Leave-One-Out) revisits pure REINFORCE with a leave-one-out baseline, offering a simpler RL alternative to PPO without a critic.
- DRPO targets a specific PPO pathology: the hard clip discards gradient signal at trust-region boundaries rather than correcting it. DRPO replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift, improving stability and efficiency in LLM post-training across model scales and precision settings.
- LamPO retains PPO's clipped-update structure but replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage — aggregating pairwise reward gaps within response groups — showing consistent improvements over GRPO on math and reasoning benchmarks with more stable training dynamics.
Tooling and infrastructure
Hugging Face's TRL library is the dominant open-source implementation surface for PPO in the LLM context. TRL v1.0, released in March 2026, stabilized the API with PPO, DPO, and GRPO as first-class methods. A notable infrastructure improvement landed earlier: co-located vLLM inference within TRL places the generation and training processes on the same GPUs simultaneously, eliminating the idle-GPU problem inherent to online PPO pipelines where generation and gradient-update phases previously required alternating dedicated allocations.
Tradeoffs and when to use it
Use PPO when: you have online access to a reward signal, sufficient GPU memory for the actor-critic pair, and want a well-understood, production-tested baseline with a large body of implementation guidance.
Consider alternatives when: memory is constrained (GRPO, RLOO), your preference data is offline (DPO), you need more stable trust-region behavior at scale (DRPO), or you want richer advantage signals for reasoning tasks (LamPO).
PPO's hard trust-region masking is a known weakness in LLM settings specifically: importance ratios are a poor proxy for distributional shift in long-tailed vocabularies, and the hard clip discards rather than corrects gradient signal at boundaries — the problem DRPO was designed to address.
Where it's heading
PPO is unlikely to be displaced as a reference algorithm — its simplicity and interpretability make it the baseline every new method must beat. But the direction of active research is clearly toward critic-free, more memory-efficient methods that preserve PPO's stability guarantees while removing its overhead. The convergence of GRPO, DRPO, and LamPO on PPO-style clipped updates with better advantage estimators suggests the field is iterating on PPO's skeleton rather than abandoning it.




