Almanac
Concept guide · In-depth

PPO: The Workhorse RL Algorithm Behind Modern LLM Alignment

PPOIn-depthactive·v1 · live·generated 6d ago

Part of these paths

TL;DRProximal Policy Optimization began as OpenAI's default reinforcement learning algorithm — chosen for its rare combination of strong performance and practical simplicity — and quietly became the backbone of the RLHF pipelines that align today's large language models. Nearly a decade on, it remains the reference point against which every newer LLM post-training method defines itself, even as a wave of lighter-weight and more stable alternatives chips away at its dominance.

Key takeaways

  • Introduced by OpenAI in July 2017, PPO became their default RL algorithm by matching or exceeding prior state-of-the-art while being simpler to implement and tune.
  • PPO underpins the canonical RLHF pipeline — supervised fine-tuning → reward model → PPO optimization — as documented in practical guides like StackLLaMA on LLaMA.
  • GRPO, RLOO, DPO, and DRPO all position themselves explicitly as alternatives or fixes to PPO's known weaknesses (critic overhead, trust-region hard masking, importance-ratio mismatch in long-tailed vocabularies).
  • Hugging Face's TRL v1.0 ships PPO, DPO, and GRPO side-by-side, making PPO one of several first-class options in the dominant open-source post-training library.
  • A co-located vLLM integration in TRL targets the idle-GPU inefficiency that plagues online PPO pipelines, where generation and gradient-update steps previously required alternating dedicated GPU allocations.
  • Recent variants like DRPO and LamPO retain PPO's clipped-update structure while replacing its hard trust-region mask or scalar advantage with smoother, more stable alternatives.

What it is

Proximal Policy Optimization (PPO) is a policy-gradient reinforcement learning algorithm introduced by OpenAI in July 2017. Its core idea is a clipped surrogate objective: rather than allowing an unconstrained policy update — which can destabilize training by moving too far from the current policy — PPO clips the importance ratio between the new and old policy to keep updates within a trust region. This achieves the stability benefits of earlier trust-region methods (like TRPO) without their second-order optimization overhead, making PPO both performant and practical.

OpenAI adopted it as their default RL algorithm on release, citing its balance of ease of implementation, ease of tuning, and strong empirical performance across a wide range of tasks.

How it works

The training loop has three phases that repeat:

1. Rollout — the current policy generates trajectories (sequences of actions and observations). 2. Advantage estimation — a separate value network (the critic) estimates how much better or worse each action was than expected, producing an advantage signal. 3. Policy update — the policy (actor) is updated to increase the probability of high-advantage actions, but only within the clipped trust region. The clip prevents the importance ratio r(θ) = π_new(a|s) / π_old(a|s) from straying too far from 1, discarding gradient signal outside the boundary.

The critic is trained in parallel to minimize the value prediction error. This actor-critic structure is PPO's most significant engineering cost: it doubles the model footprint and adds a second optimization target.

Why it matters

PPO became the load-bearing algorithm of the RLHF (Reinforcement Learning from Human Feedback) era. The canonical alignment pipeline — supervised fine-tuning → reward model training → PPO optimization against the reward model — was documented and popularized through practical guides like Hugging Face's StackLLaMA tutorial, which walked through the full workflow on Meta's LLaMA using the TRL library. PPO's role here is to update the language model's policy to maximize reward-model scores while a KL penalty prevents it from drifting too far from the supervised baseline.

Beyond language models, PPO demonstrated its range in game-playing: an imitation-seeded curriculum using PPO achieved a score of 74,500 on Montezuma's Revenge — a notoriously hard-exploration Atari game — from a single human demonstration, surpassing all prior published results.

Variants and alternatives

The post-training landscape has fragmented significantly since PPO's RLHF debut, with every major alternative defining itself in relation to PPO's known costs and failure modes:

  • GRPO eliminates the critic entirely by computing group-relative advantages across a batch of sampled responses. It is now a co-equal method in TRL v1.0 alongside PPO and DPO, and has become the default for many LLM reasoning fine-tunes.
  • DPO sidesteps the RL loop altogether, framing preference learning as a supervised classification problem. It requires no reward model at inference and no online rollouts, at the cost of being offline.
  • RLOO (REINFORCE Leave-One-Out) revisits pure REINFORCE with a leave-one-out baseline, offering a simpler RL alternative to PPO without a critic.
  • DRPO targets a specific PPO pathology: the hard clip discards gradient signal at trust-region boundaries rather than correcting it. DRPO replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift, improving stability and efficiency in LLM post-training across model scales and precision settings.
  • LamPO retains PPO's clipped-update structure but replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage — aggregating pairwise reward gaps within response groups — showing consistent improvements over GRPO on math and reasoning benchmarks with more stable training dynamics.

Tooling and infrastructure

Hugging Face's TRL library is the dominant open-source implementation surface for PPO in the LLM context. TRL v1.0, released in March 2026, stabilized the API with PPO, DPO, and GRPO as first-class methods. A notable infrastructure improvement landed earlier: co-located vLLM inference within TRL places the generation and training processes on the same GPUs simultaneously, eliminating the idle-GPU problem inherent to online PPO pipelines where generation and gradient-update phases previously required alternating dedicated allocations.

Tradeoffs and when to use it

Use PPO when: you have online access to a reward signal, sufficient GPU memory for the actor-critic pair, and want a well-understood, production-tested baseline with a large body of implementation guidance.

Consider alternatives when: memory is constrained (GRPO, RLOO), your preference data is offline (DPO), you need more stable trust-region behavior at scale (DRPO), or you want richer advantage signals for reasoning tasks (LamPO).

PPO's hard trust-region masking is a known weakness in LLM settings specifically: importance ratios are a poor proxy for distributional shift in long-tailed vocabularies, and the hard clip discards rather than corrects gradient signal at boundaries — the problem DRPO was designed to address.

Where it's heading

PPO is unlikely to be displaced as a reference algorithm — its simplicity and interpretability make it the baseline every new method must beat. But the direction of active research is clearly toward critic-free, more memory-efficient methods that preserve PPO's stability guarantees while removing its overhead. The convergence of GRPO, DRPO, and LamPO on PPO-style clipped updates with better advantage estimators suggests the field is iterating on PPO's skeleton rather than abandoning it.

PPO actor-critic training loop

PPO and its LLM post-training alternatives

MethodCritic required?Trust-region mechanismKey advantage over PPOStatus in TRL v1.0
PPOYes (value network)Hard clip on importance ratioReference baseline; well-understoodYes
GRPONoGroup-relative advantage normalizationCritic-free; lower memoryYes
DPONoImplicit via preference lossNo RL loop; offlineYes
RLOONoLeave-one-out baselineSimpler RL; no criticYes
DRPONoSmooth quadratic regularizer on policy shiftFixes hard-mask gradient discard
LamPONoClipped update (PPO-style)Pairwise decomposed advantage; more stable

Synthesized from the events bundle; '—' indicates not confirmed in the provided events.

Timeline

  1. PPO introduced by OpenAI; adopted as their default RL algorithm

  2. PPO scores 74,500 on Montezuma's Revenge from a single demonstration

  3. StackLLaMA tutorial documents the full PPO-based RLHF pipeline on LLaMA

  4. Hugging Face introduces RLOO as a practical alternative to PPO-based RLHF

  5. TRL co-located vLLM integration targets idle-GPU inefficiency in PPO/GRPO pipelines

  6. TRL v1.0 ships PPO, DPO, and GRPO as co-equal post-training methods

  7. DRPO proposes smooth divergence regularization to fix PPO's hard trust-region masking

Related topics

FAQ

Why does PPO need a critic (value network) when methods like GRPO don't?

PPO estimates a baseline value for each state to reduce variance in the policy gradient — this requires a separate value network trained in parallel. GRPO and RLOO replace this with group-relative or leave-one-out baselines computed from the sampled responses themselves, eliminating the critic at the cost of needing multiple rollouts per prompt.

What is the 'hard masking' problem that DRPO tries to fix?

PPO clips the importance ratio between the new and old policy to stay within a trust region; when the ratio falls outside the clip boundary, the gradient is discarded entirely rather than corrected. DRPO replaces this hard mask with a smooth quadratic regularizer, recovering gradient signal at trust-region boundaries and improving stability in LLM post-training.

Is PPO still the right default for RLHF in 2026?

It depends on your constraints. PPO remains well-understood and production-tested, but critic-free methods like GRPO and RLOO are now first-class options in TRL v1.0 and are preferred when memory or simplicity is a priority; DPO sidesteps the RL loop entirely for offline preference data.

What was the idle-GPU problem in PPO pipelines and how was it addressed?

Online PPO alternates between a generation phase (using the model for rollouts) and a training phase (gradient updates), leaving GPUs idle during whichever phase isn't running. TRL's co-located vLLM integration places inference and training on the same GPUs simultaneously, eliminating that idle time.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on PPO (6)

8Openai Blog·1mo ago·source ↗

OpenAI Releases Proximal Policy Optimization (PPO)

OpenAI introduced Proximal Policy Optimization (PPO), a new class of reinforcement learning algorithms that match or exceed state-of-the-art performance while being simpler to implement and tune. PPO was adopted as OpenAI's default RL algorithm due to its balance of ease of use and strong performance. The release marked a significant methodological contribution to the RL field that would go on to underpin many subsequent AI training pipelines.

5Openai Blog·1mo ago·source ↗

Learning Montezuma's Revenge from a Single Demonstration

OpenAI trained a reinforcement learning agent to achieve a score of 74,500 on Montezuma's Revenge using a single human demonstration, surpassing all previously published results. The method is straightforward: the agent plays episodes starting from carefully selected states drawn from the demonstration, optimizing game score via PPO. This approach demonstrates that imitation-seeded curriculum learning can dramatically improve exploration in hard-exploration environments. The same PPO algorithm underpins OpenAI Five.

5Hugging Face Blog·1mo ago·source ↗

StackLLaMA: A hands-on guide to train LLaMA with RLHF

Hugging Face published a detailed tutorial demonstrating how to fine-tune Meta's LLaMA model using Reinforcement Learning from Human Feedback (RLHF) on StackExchange data. The guide covers the full pipeline: supervised fine-tuning, reward model training, and PPO-based RL optimization. It serves as a practical reference for practitioners seeking to replicate RLHF workflows on open-weight models using the TRL library.

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL

A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.

6Hugging Face Blog·1mo ago·source ↗

TRL v1.0: Post-Training Library Built to Move with the Field

Hugging Face has released TRL v1.0, a major milestone for its post-training library focused on reinforcement learning from human feedback and related alignment techniques. The release signals a stabilization of the API and feature set after iterative development tracking the rapidly evolving post-training landscape. TRL is widely used in the open-source community for fine-tuning and aligning language models using methods such as PPO, DPO, and GRPO.

5Hugging Face Blog·1mo ago·source ↗

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

Hugging Face's TRL library now supports co-locating vLLM inference alongside training on the same GPUs, eliminating the idle GPU problem that arises when separate inference and training processes alternate. This approach allows reinforcement learning from human feedback (RLHF) and online RL training pipelines to use GPUs continuously rather than leaving them idle during generation or gradient update phases. The integration targets efficiency gains in online RL training workflows such as GRPO and PPO, where generation and training steps previously required dedicated, alternating GPU allocations.