Entity · technique

Proximal Policy Optimization

techniqueactiveproximal-policy-optimization-4fe84caa·12 events·first seen May 19, 2026

Aliases: Proximal Policy Optimization, Zone of Proximal Policy Optimization

Co-occurring entities

More like this (12)

APPO: Agentic Procedural Policy Optimization Pareto Optimal Policy Optimization Preference Coordinated Multi-agent Policy Optimization Role-Aware Policy Optimization Divergence Regularized Policy Optimization Direct Preference Optimization (DPO)Vector Policy Optimization GRPO (Group Relative Policy Optimization)Denoising Diffusion Policy Optimization Hierarchical Relative Policy Optimization Hybrid Median-length Policy Optimization Bayesian Optimization

Guides (1)

Proximal Policy OptimizationConcept

Proximal Policy Optimization (PPO): The Algorithm That Trains AI to Learn from Feedback

Read asBeginner In-depth

Recent events (12)

6arXiv · cs.AI·3d ago·source ↗

Pictura: GPU-accelerated simulator enables first large-scale driving self-play from egocentric camera images

Researchers introduce Pictura, a GPU-accelerated multi-agent driving simulator that renders each agent's egocentric camera view at every training step, enabling self-play without privileged observations like exact poses or velocities. Using Pictura, they train Alberti via PPO over 50B agent steps (~35M km of driving), the first large-scale driving policy trained directly from perspective images. Alberti approaches the performance of privileged vectorized counterparts and zero-shot transfers to Waymo Open Motion Dataset layouts, outperforming privileged agents there. The work addresses a longstanding representation gap between simulation-trained policies and real deployed agents.

Training Infrastructure Agent and Tool Ecosystem Alberti Proximal Policy Optimization Waymo Open Motion Dataset +2 more

5arXiv · cs.CL·Jul 15, 2026·source ↗

CARE-PPO: PPO-based RL framework for joint quantitative prediction and confidence estimation in LLMs

Researchers introduce CARE-PPO, a reinforcement learning fine-tuning framework that jointly trains LLMs for numerical prediction accuracy and calibrated confidence estimation. The approach repurposes the PPO critic as a confidence estimator at inference time, using a Confidence-Aligned Reward for Estimation derived from prediction error. Evaluated on healthcare and finance tasks with Qwen-3 4B and 8B models, CARE-PPO outperforms logit-based and verbalized confidence baselines and shows improved out-of-distribution generalization. The work addresses the hallucination and overconfidence problems that limit LLM deployment in high-stakes quantitative domains.

Evaluation and Benchmarking Alignment and RLHF Proximal Policy Optimization Qwen3 CARE-PPO

5arXiv · cs.LG·Jul 9, 2026·source ↗

Selective timestep weighting and advantage-based replay improve diffusion RLHF sample efficiency by up to 6×

A new arXiv preprint proposes two complementary techniques to improve feedback efficiency in diffusion model RLHF: a per-timestep weighting scheme grounded in PPO convergence theory, and a replay mechanism that prioritizes informative trajectories to reduce redundant reward queries. Together, the methods achieve up to 6× improvement in sample efficiency over standard diffusion RLHF baselines under identical hyperparameter settings. The work addresses a practical bottleneck—feedback cost—that limits real-world deployment of RLHF-aligned diffusion models.

Alignment and RLHF Multimodal Progress Selective Timestep Weighting and Advantage-Based Replay for Sample-Efficient Diffusion RLHF Proximal Policy Optimization

7The Batch·Jun 26, 2026·source ↗

Z.ai releases GLM-5.2, a 753B MoE open-weights model claiming top open-model ranking on agentic coding benchmarks

Z.ai released GLM-5.2, a 753-billion-parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, with a 1-million-token input context and MIT license. The model ranks first among open-weights models on Artificial Analysis's Intelligence Index v4.1 (score 51, behind Claude Opus 4.8 at 56 and GPT-5.5 at 55) and leads all models on PostTrainBench, a benchmark for agentic fine-tuning tasks. Key technical contributions include a modified sparse attention indexer applied every four layers (cutting per-token computation 2.9x at 1M context), a switch from GRPO to PPO for long-horizon RL training, and a reward-hacking mitigation pipeline using rule-based filters and a judge model. API pricing is substantially below comparable proprietary models, and the release coincides with U.S. government restrictions on access to Anthropic's frontier models.

Open Weights Progress Inference Economics Artificial Analysis Intelligence Index AA-Briefcase DeepSeek V4 +14 more

6arXiv · cs.LG·Jun 23, 2026·source ↗

CoorDex: Learning pipeline for continuous dexterous humanoid loco-manipulation with high-DoF hands

CoorDex is a reinforcement learning pipeline that enables humanoid robots to perform dexterous manipulation while walking, eliminating the stop-and-go pattern common in prior work. The approach trains separate privileged motion tracking teachers for body and hand, distills them into latent priors, and uses coordinated residual RL to compose them for downstream tasks. Demonstrated on a Unitree G1 humanoid with a 20-DoF WUJI hand, the system achieves non-stop bottle grasping, fridge door opening, and cube manipulation in motion. Ablations show that naive joint-space or monolithic approaches fail under the same reward budget, validating the latent-prior architecture.

Agent and Tool Ecosystem WUJI hand Proximal Policy Optimization CoorDex +1 more

6arXiv · cs.CL·Jun 17, 2026·source ↗

ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models

Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.

Open Weights Progress Alignment and RLHF GRPO Proximal Policy Optimization Qwen3 +1 more

6arXiv · cs.AI·Jun 3, 2026·source ↗

AgenticRL: Self-refining LLM-guided reward design and policy refinement for UAV navigation

AgenticRL is a framework that uses a multimodal GPT agent to automate reward function generation, policy training via PPO, and closed-loop self-refinement for UAV navigation tasks. The agent evaluates trained policies through diagnostic feedback, identifies failure modes, and iteratively refines rewards without human intervention. Evaluated across five navigation tasks, the closed-loop refinement improves policy behavior by 71% over initial rewards, with sim-to-real transfer achieving 91% real-world success rate and 94% sim-to-real accuracy.

Agent and Tool Ecosystem Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation AgenticRL Proximal Policy Optimization

8Openai Blog·May 20, 2026·source ↗

OpenAI Releases Proximal Policy Optimization (PPO)

OpenAI introduced Proximal Policy Optimization (PPO), a new class of reinforcement learning algorithms that match or exceed state-of-the-art performance while being simpler to implement and tune. PPO was adopted as OpenAI's default RL algorithm due to its balance of ease of use and strong performance. The release marked a significant methodological contribution to the RL field that would go on to underpin many subsequent AI training pipelines.

AI Safety Research Alignment and RLHF PPO Proximal Policy Optimization OpenAI

6Openai Blog·May 20, 2026·source ↗

OpenAI Five Defeats Amateur Human Teams at Dota 2

OpenAI announced that OpenAI Five, a team of five neural networks trained via self-play, has begun defeating amateur human teams at Dota 2. This represented an early milestone in applying reinforcement learning to complex, long-horizon multi-agent environments. The system was trained using large-scale distributed RL, demonstrating that neural networks could coordinate in real-time strategy games without hand-crafted rules.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI Five Dota 2 Proximal Policy Optimization +1 more

6Openai Blog·May 20, 2026·source ↗

Dota 2 with Large Scale Deep Reinforcement Learning

OpenAI published a detailed account of the OpenAI Five system that defeated world-champion Dota 2 players using large-scale deep reinforcement learning. The work describes the training infrastructure, self-play curriculum, and scaling properties that enabled superhuman performance in a complex multi-agent environment. This represents a landmark result in applying RL at scale to long-horizon, high-dimensional tasks.

Training Infrastructure AI Safety Research OpenAI Five Dota 2 Proximal Policy Optimization +1 more

5Hugging Face Blog·May 19, 2026·source ↗

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This Hugging Face blog post provides an illustrated overview of Reinforcement Learning from Human Feedback (RLHF), explaining the technique used to align large language models with human preferences. It covers the core pipeline: pretraining a language model, collecting human preference data, training a reward model, and fine-tuning with RL. Published in December 2022, it served as an accessible reference during the period when RLHF was becoming central to frontier model development.

Frontier Model Releases Alignment and RLHF Reinforcement Learning from Human Feedback Proximal Policy Optimization Hugging Face +1 more

6Hugging Face Blog·May 19, 2026·source ↗

The N Implementation Details of RLHF with PPO

This Hugging Face blog post catalogs the numerous low-level implementation details that matter when applying Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for language model fine-tuning. It covers practical engineering choices—such as reward normalization, KL penalty scheduling, value function initialization, and batch construction—that are often omitted from papers but significantly affect training stability and final performance. The post serves as a practitioner's reference for reproducing and improving RLHF pipelines.

Agent and Tool Ecosystem Alignment and RLHF KL Divergence Reinforcement Learning from Human Feedback Proximal Policy Optimization +2 more