Almanac
Concept guide · In-depth

Proximal Policy Optimization (PPO): The Workhorse of Modern RL Training

Proximal Policy OptimizationIn-depthactive·v1 · live·generated 38h ago
TL;DRPPO began as a simpler, more stable alternative to prior policy-gradient methods and quickly became the default reinforcement learning algorithm at OpenAI and across the field. Its combination of clipped surrogate objectives, ease of tuning, and broad applicability carried it from game-playing agents to the RLHF pipelines that align today's large language models — and it continues to be chosen over newer alternatives when long-horizon stability matters most.

Key takeaways

  • Introduced by OpenAI in July 2017 as a drop-in improvement over TRPO, PPO became OpenAI's default RL algorithm immediately on release.
  • OpenAI Five — the system that defeated world-champion Dota 2 players — was trained at scale using PPO, demonstrating its viability for long-horizon, high-dimensional multi-agent tasks.
  • PPO is the optimizer inside the standard RLHF pipeline: pretrain LM → collect preferences → train reward model → fine-tune with PPO.
  • Practical RLHF with PPO requires careful engineering of reward normalization, KL penalty scheduling, value function initialization, and batch construction — details rarely captured in papers.
  • Z.ai's GLM-5.2 (June 2026) switched from GRPO back to PPO for long-horizon RL training, citing stability advantages for agentic coding tasks.
  • ZPPO (June 2026) addresses a known PPO failure mode — zero-reward rollouts producing no gradient signal — by embedding teacher guidance in prompts rather than relying on policy gradients alone.

What it is

Proximal Policy Optimization (PPO) is a policy-gradient reinforcement learning algorithm introduced by OpenAI in July 2017. Its core idea is simple: when updating a policy from collected experience, clip the ratio between the new and old action probabilities so that no single update moves the policy too far. This "proximal" constraint — enforced cheaply through a clipped surrogate loss rather than an expensive second-order trust-region calculation — lets practitioners run multiple gradient epochs over the same batch of experience without destabilizing training. OpenAI adopted PPO as its default RL algorithm on release, citing its balance of performance and ease of use.

How it works

The standard PPO objective clips the probability ratio $r_t(\theta) = \pi_\theta(a|s) / \pi_{\theta_\text{old}}(a|s)$ at $[1-\epsilon, 1+\epsilon]$, then takes the minimum of the clipped and unclipped objectives to form a pessimistic bound on policy improvement. A separate value network (critic) is trained in parallel to estimate returns, providing a variance-reducing baseline for the policy gradient. In practice, PPO alternates between collecting rollouts under the current policy and running several epochs of minibatch gradient descent on the combined actor-critic loss — a pattern that makes it straightforward to parallelize across many environment workers.

`` Collect rollouts → Compute advantages → Clip ratio → Update actor + critic → Repeat ``

The KL divergence between old and new policy is monitored (and sometimes penalized directly) to catch cases where the clipping bound is insufficient.

Why it matters

PPO's significance extends well beyond its original game-playing context. Two trajectories define its impact:

Game-playing at scale. OpenAI Five — the system that defeated amateur Dota 2 teams in 2018 and world champions in 2019 — was trained using large-scale distributed PPO with self-play. Dota 2 is a long-horizon, high-dimensional, multi-agent environment; the fact that PPO scaled to superhuman performance there established it as viable for the hardest RL problems of its era.

RLHF for language models. The standard pipeline for aligning large language models with human preferences — pretrain, collect preference data, train a reward model, fine-tune with RL — uses PPO as its optimizer. PPO's stability under a KL penalty (which keeps the fine-tuned model close to the base) and its value network (which reduces gradient variance over long token sequences) make it well-suited to the LLM fine-tuning regime. A 2022 Hugging Face overview established this pipeline as the canonical RLHF recipe; a 2023 practitioner reference from the same source cataloged the low-level engineering details — reward normalization, KL penalty scheduling, value function initialization, batch construction — that papers routinely omit but that determine whether training actually converges.

Variants and alternatives

The post-training landscape has diversified considerably since 2017, but PPO remains the reference point:

  • GRPO (Group Relative Policy Optimization) eliminates the value network by normalizing rewards within a group of rollouts for the same prompt. This reduces memory and compute but removes the baseline that helps PPO handle sparse or delayed rewards. Z.ai's GLM-5.2 (June 2026) switched from GRPO back to PPO for long-horizon agentic coding RL, explicitly citing stability advantages — a concrete data point on where GRPO's tradeoffs bite.
  • DPO (Direct Preference Optimization) bypasses the RL loop entirely, framing preference alignment as a supervised objective. It is simpler to implement but requires a different data format and cannot easily incorporate process-level or outcome-based reward signals.
  • ZPPO (Zone of Proximal Policy Optimization, June 2026) addresses a specific PPO failure mode: when a small model's rollouts all receive zero reward, no gradient signal flows. ZPPO embeds teacher guidance directly in prompts (Binary and Negative Candidate-included Questions) and uses a replay buffer for hard questions, outperforming both distillation and GRPO baselines on the Qwen3 family (0.8B–9B) across a 31-benchmark suite — with the largest gains at the smallest model scale.

PPO in robotics and agentic systems

Beyond LLMs, PPO remains the optimizer of choice in continuous-control settings. AgenticRL (June 2026) uses PPO inside a closed-loop framework where a multimodal GPT agent generates reward functions, trains policies, and iteratively refines them for UAV navigation — achieving a 71% improvement in policy behavior over initial rewards and 91% real-world success. CoorDex (June 2026), a pipeline for dexterous humanoid loco-manipulation, trains coordinated residual RL policies using PPO-style updates to compose latent priors for body and hand motion on a Unitree G1 robot.

Implementation tradeoffs and pitfalls

PPO's practical complexity is higher than its clean objective suggests. The Hugging Face practitioner reference identifies several critical knobs for RLHF stability:

  • Reward normalization: unnormalized rewards cause value function divergence.
  • KL penalty scheduling: too tight and the model doesn't learn; too loose and it reward-hacks.
  • Value function initialization: initializing from the LM's final layer rather than randomly is often essential.
  • Batch construction: mixing on-policy and off-policy data, or using stale rollouts, degrades performance in ways that are hard to diagnose.

Z.ai's GLM-5.2 also deployed a reward-hacking mitigation pipeline — rule-based filters plus a judge model — alongside PPO, underscoring that the algorithm alone does not prevent reward exploitation in long-horizon settings.

Where it's heading

PPO's longevity is a function of its stability properties, which matter most precisely where newer, lighter alternatives struggle: long-horizon tasks, sparse rewards, and settings where a value baseline is worth its memory cost. The 2026 evidence — GLM-5.2 reverting from GRPO, ZPPO patching its zero-reward failure mode, robotics pipelines continuing to rely on it — suggests PPO is not being displaced so much as specialized around. Expect it to remain the default for high-stakes, long-horizon RL while lighter methods (GRPO, DPO) handle the dense-reward, short-horizon alignment cases where their simplicity wins.

PPO training loop and its role in RLHF

PPO across application domains

PPO vs. key alternatives in the post-training landscape

MethodUpdate mechanismCritic requiredStabilityBest for
PPOClipped surrogate objective; multiple epochs per batchYes (value network)High — bounded policy shiftLong-horizon tasks, RLHF, robotics
TRPOKL-constrained trust region; second-order updateYesHigh but expensiveWhen exact constraint is needed
GRPOGroup-relative reward normalization; no value networkNoModerateLLM reasoning tasks with dense rewards
DPODirect preference optimization; no RL loopNoHigh — supervised-stylePreference alignment without a reward model
REINFORCEVanilla policy gradient; high varianceNoLowSimple / short-horizon tasks

Synthesized from the events bundle; unknown cells render —.

Timeline

  1. PPO introduced by OpenAI; adopted as default RL algorithm

  2. OpenAI Five defeats amateur Dota 2 teams using large-scale PPO

  3. OpenAI Five defeats world champions; landmark for long-horizon RL at scale

  4. RLHF overview published; PPO established as the LLM alignment optimizer

  5. Hugging Face catalogs PPO implementation details critical for RLHF stability

  6. Z.ai GLM-5.2 switches from GRPO to PPO for long-horizon agentic RL

Related topics

FAQ

Why did PPO displace TRPO as the go-to policy-gradient method?

PPO achieves similar stability guarantees to TRPO's trust-region constraint but replaces the expensive second-order update with a simple clipped objective, making it far cheaper to implement and tune while matching or exceeding TRPO's performance.

What makes PPO the standard optimizer for RLHF?

RLHF requires stable, incremental policy updates against a reward model while a KL penalty keeps the fine-tuned model close to the base — exactly the regime PPO was designed for. Its value network also provides a baseline that reduces gradient variance over long token sequences.

Why would a lab switch back to PPO from GRPO?

Z.ai's GLM-5.2 switched from GRPO to PPO for long-horizon agentic coding tasks, suggesting that GRPO's lack of a value network becomes a liability when reward signals are sparse or delayed across many steps.

What are the most common failure modes when running PPO for LLM fine-tuning?

Reward hacking, training instability from poorly initialized value functions, and KL penalty miscalibration are the most documented; a Hugging Face reference post catalogs reward normalization, KL scheduling, value initialization, and batch construction as the critical engineering knobs.

Is PPO used in robotics as well as LLMs?

Yes — the events bundle includes PPO-based pipelines for UAV navigation (AgenticRL) and dexterous humanoid manipulation (CoorDex), both of which rely on PPO's stability for continuous-control tasks.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live38h ago

Related guides (4)

More on Proximal Policy Optimization (6)

8Openai Blog·1mo ago·source ↗

OpenAI Releases Proximal Policy Optimization (PPO)

OpenAI introduced Proximal Policy Optimization (PPO), a new class of reinforcement learning algorithms that match or exceed state-of-the-art performance while being simpler to implement and tune. PPO was adopted as OpenAI's default RL algorithm due to its balance of ease of use and strong performance. The release marked a significant methodological contribution to the RL field that would go on to underpin many subsequent AI training pipelines.

6arXiv · cs.CL·13d ago·source ↗

ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models

Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.

6Hugging Face Blog·1mo ago·source ↗

The N Implementation Details of RLHF with PPO

This Hugging Face blog post catalogs the numerous low-level implementation details that matter when applying Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for language model fine-tuning. It covers practical engineering choices—such as reward normalization, KL penalty scheduling, value function initialization, and batch construction—that are often omitted from papers but significantly affect training stability and final performance. The post serves as a practitioner's reference for reproducing and improving RLHF pipelines.

5Hugging Face Blog·1mo ago·source ↗

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This Hugging Face blog post provides an illustrated overview of Reinforcement Learning from Human Feedback (RLHF), explaining the technique used to align large language models with human preferences. It covers the core pipeline: pretraining a language model, collecting human preference data, training a reward model, and fine-tuning with RL. Published in December 2022, it served as an accessible reference during the period when RLHF was becoming central to frontier model development.

6Openai Blog·1mo ago·source ↗

Dota 2 with Large Scale Deep Reinforcement Learning

OpenAI published a detailed account of the OpenAI Five system that defeated world-champion Dota 2 players using large-scale deep reinforcement learning. The work describes the training infrastructure, self-play curriculum, and scaling properties that enabled superhuman performance in a complex multi-agent environment. This represents a landmark result in applying RL at scale to long-horizon, high-dimensional tasks.

6arXiv · cs.AI·27d ago·source ↗

AgenticRL: Self-refining LLM-guided reward design and policy refinement for UAV navigation

AgenticRL is a framework that uses a multimodal GPT agent to automate reward function generation, policy training via PPO, and closed-loop self-refinement for UAV navigation tasks. The agent evaluates trained policies through diagnostic feedback, identifies failure modes, and iteratively refines rewards without human intervention. Evaluated across five navigation tasks, the closed-loop refinement improves policy behavior by 71% over initial rewards, with sim-to-real transfer achieving 91% real-world success rate and 94% sim-to-real accuracy.