Almanac
Concept guide · Beginner

Proximal Policy Optimization (PPO): The Algorithm That Trains AI to Learn from Feedback

Proximal Policy OptimizationBeginneractive·v1 · live·generated 38h ago
TL;DRPPO is a reinforcement learning algorithm that teaches AI systems to improve through trial and error — and it became the backbone of how modern AI assistants are trained to be helpful and safe. Introduced by OpenAI in 2017 as a simpler, more stable alternative to earlier methods, it has since powered everything from world-champion game-playing bots to the human-feedback pipelines behind today's large language models.

Key takeaways

  • OpenAI introduced PPO in July 2017 and immediately adopted it as their default reinforcement learning algorithm.
  • PPO powered OpenAI Five, which defeated world-champion Dota 2 players — a landmark test of RL at scale in a complex, long-horizon environment.
  • PPO is the engine inside RLHF (Reinforcement Learning from Human Feedback), the technique used to make language models like ChatGPT follow instructions and behave helpfully.
  • In 2026, Z.ai's GLM-5.2 switched from GRPO back to PPO for long-horizon RL training, showing PPO remains competitive against newer alternatives.
  • A widely-cited Hugging Face guide documents dozens of low-level engineering choices — reward normalization, KL penalty scheduling — that make or break a PPO-based training run.

What PPO is — and why you should care

Proximal Policy Optimization, or PPO, is a reinforcement learning (RL) algorithm — a recipe for teaching an AI to get better at a task by trying things, seeing what works, and adjusting its behavior accordingly. Think of it like training a dog: reward the good behaviors, discourage the bad ones, and repeat until the dog (or AI) figures out what you want.

What made PPO special when OpenAI introduced it in July 2017 was that it solved a frustrating problem with earlier RL methods: they were either too slow and cautious, or they updated the AI's behavior so aggressively that training would collapse. PPO found a sweet spot — it clips (limits) how much the AI's behavior can change in any single update, keeping learning stable without sacrificing speed. OpenAI liked it so much they adopted it as their default RL algorithm immediately.

How it works (without the math)

Imagine you're coaching a chess player. After each game, you give feedback. A bad coach might say "change everything about how you play" — overwhelming and counterproductive. A good coach says "adjust a few specific things, see how it goes, then adjust again." PPO is the good coach: it makes measured, bounded updates so the AI improves steadily rather than lurching around.

The "proximal" in the name literally means "nearby" — each new version of the AI's behavior must stay close to the previous version. This constraint is what makes training reliable enough to run at massive scale.

The big moments: games, then language

PPO's first headline moment was OpenAI Five, a team of five neural networks that learned to play the complex strategy game Dota 2 entirely through self-play. By June 2018 it was beating amateur human teams; by December 2019 it had defeated world champions. This was a landmark: PPO had proven it could handle tasks with long chains of decisions, partial information, and real-time coordination — far harder than the simple environments RL had previously mastered.

Then came an even bigger application: RLHF (Reinforcement Learning from Human Feedback). This is the technique that turned raw language models into helpful assistants. The pipeline works like this: human raters score AI responses, a "reward model" learns to predict those scores, and then PPO trains the language model to produce responses the reward model rates highly. The result is an AI that follows instructions, avoids harmful outputs, and generally behaves the way its developers intend. PPO is the engine that makes the final step work.

The hidden complexity

Running PPO well in practice turns out to be tricky. A widely-read Hugging Face guide documented dozens of engineering details — how to normalize rewards, how to schedule the penalty that keeps the AI from drifting too far from its original behavior, how to initialize the value function — that papers often leave out but that make the difference between a training run that works and one that doesn't. This gap between "PPO in theory" and "PPO that actually trains a good model" is why experienced practitioners treat it as a craft, not just a formula.

PPO vs. newer alternatives

PPO isn't the only option anymore. GRPO (Group Relative Policy Optimization) emerged as a simpler alternative that skips the separate value model PPO requires, making it cheaper to run for certain tasks. ZPPO is a newer method that embeds teacher guidance directly in prompts, helping small models learn when they'd otherwise get no useful feedback signal at all.

Yet PPO keeps proving its worth. In 2026, Z.ai's GLM-5.2 — a 753-billion-parameter open-weights model — switched back to PPO from GRPO specifically for long-horizon agentic training, finding it more reliable for complex, extended tasks. PPO has also been used to train UAV navigation policies that self-refine through closed-loop feedback, and it underpins robotics research like coordinated dexterous manipulation in humanoid robots.

Why it endures

PPO's staying power comes from a combination of simplicity, stability, and generality. It doesn't require exotic infrastructure, it works across wildly different domains (games, language, robots, drones), and its behavior is well-understood enough that practitioners know how to debug it when things go wrong. In a field where new methods appear constantly, that combination of reliability and breadth is hard to beat.

How PPO fits into the RLHF pipeline

PPO vs. alternatives for AI training

MethodHow it worksBest forNotable use
PPOClips policy updates to stay close to the previous version; stable and generalLong-horizon tasks, RLHF, complex environmentsOpenAI Five, RLHF pipelines, GLM-5.2
GRPOUses group-relative rewards instead of a separate value model; simpler setupReasoning tasks, smaller-scale RLVarious LLM reasoning fine-tunes
ZPPOEmbeds teacher guidance in prompts; helps when student gets zero rewardSmall vision-language modelsQwen3 family fine-tuning

Synthesized from the events bundle; unknown cells render —.

Timeline

  1. OpenAI introduces PPO; adopts it as default RL algorithm

  2. OpenAI Five defeats amateur Dota 2 teams using large-scale PPO

  3. OpenAI Five defeats world champions; landmark for long-horizon RL

  4. RLHF with PPO becomes central to frontier LLM alignment

  5. Hugging Face catalogs the engineering details that make PPO-based RLHF work in practice

  6. GLM-5.2 switches from GRPO to PPO for long-horizon agentic training

Related topics

FAQ

What does PPO actually do?

It trains an AI agent to take better actions by rewarding good outcomes and penalizing bad ones — but it updates the agent's behavior in careful, small steps so training stays stable and doesn't spiral out of control.

Why does PPO matter for AI assistants like ChatGPT?

PPO is the engine inside RLHF — the process where human raters score AI responses and the model is trained to produce more of what humans prefer. Without a stable algorithm like PPO, that feedback loop is hard to run reliably.

Is PPO still used, or have newer methods replaced it?

It remains widely used. As recently as 2026, Z.ai's GLM-5.2 switched back to PPO from a newer alternative (GRPO) specifically for long-horizon training tasks, citing its reliability.

Do I need to understand the math to benefit from knowing about PPO?

Not at all — the key intuition is 'learn from feedback, but don't change too fast.' The math enforces that intuition; the concept is what matters for understanding why modern AI behaves the way it does.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live38h ago

Related guides (4)

More on Proximal Policy Optimization (6)

8Openai Blog·1mo ago·source ↗

OpenAI Releases Proximal Policy Optimization (PPO)

OpenAI introduced Proximal Policy Optimization (PPO), a new class of reinforcement learning algorithms that match or exceed state-of-the-art performance while being simpler to implement and tune. PPO was adopted as OpenAI's default RL algorithm due to its balance of ease of use and strong performance. The release marked a significant methodological contribution to the RL field that would go on to underpin many subsequent AI training pipelines.

6arXiv · cs.CL·13d ago·source ↗

ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models

Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.

6Hugging Face Blog·1mo ago·source ↗

The N Implementation Details of RLHF with PPO

This Hugging Face blog post catalogs the numerous low-level implementation details that matter when applying Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for language model fine-tuning. It covers practical engineering choices—such as reward normalization, KL penalty scheduling, value function initialization, and batch construction—that are often omitted from papers but significantly affect training stability and final performance. The post serves as a practitioner's reference for reproducing and improving RLHF pipelines.

5Hugging Face Blog·1mo ago·source ↗

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This Hugging Face blog post provides an illustrated overview of Reinforcement Learning from Human Feedback (RLHF), explaining the technique used to align large language models with human preferences. It covers the core pipeline: pretraining a language model, collecting human preference data, training a reward model, and fine-tuning with RL. Published in December 2022, it served as an accessible reference during the period when RLHF was becoming central to frontier model development.

6Openai Blog·1mo ago·source ↗

Dota 2 with Large Scale Deep Reinforcement Learning

OpenAI published a detailed account of the OpenAI Five system that defeated world-champion Dota 2 players using large-scale deep reinforcement learning. The work describes the training infrastructure, self-play curriculum, and scaling properties that enabled superhuman performance in a complex multi-agent environment. This represents a landmark result in applying RL at scale to long-horizon, high-dimensional tasks.

6arXiv · cs.AI·27d ago·source ↗

AgenticRL: Self-refining LLM-guided reward design and policy refinement for UAV navigation

AgenticRL is a framework that uses a multimodal GPT agent to automate reward function generation, policy training via PPO, and closed-loop self-refinement for UAV navigation tasks. The agent evaluates trained policies through diagnostic feedback, identifies failure modes, and iteratively refines rewards without human intervention. Evaluated across five navigation tasks, the closed-loop refinement improves policy behavior by 71% over initial rewards, with sim-to-real transfer achieving 91% real-world success rate and 94% sim-to-real accuracy.