Almanac
technique

PPO

techniqueactiveppo-9a605f11·11 events·first seen 1mo ago

Aliases: PPO

Co-occurring entities

More like this (12)

Guides (1)

Recent events (11)

8Openai Blog·1mo ago·source ↗

OpenAI Releases Proximal Policy Optimization (PPO)

OpenAI introduced Proximal Policy Optimization (PPO), a new class of reinforcement learning algorithms that match or exceed state-of-the-art performance while being simpler to implement and tune. PPO was adopted as OpenAI's default RL algorithm due to its balance of ease of use and strong performance. The release marked a significant methodological contribution to the RL field that would go on to underpin many subsequent AI training pipelines.

5Openai Blog·1mo ago·source ↗

Learning Montezuma's Revenge from a Single Demonstration

OpenAI trained a reinforcement learning agent to achieve a score of 74,500 on Montezuma's Revenge using a single human demonstration, surpassing all previously published results. The method is straightforward: the agent plays episodes starting from carefully selected states drawn from the demonstration, optimizing game score via PPO. This approach demonstrates that imitation-seeded curriculum learning can dramatically improve exploration in hard-exploration environments. The same PPO algorithm underpins OpenAI Five.

5Hugging Face Blog·1mo ago·source ↗

StackLLaMA: A hands-on guide to train LLaMA with RLHF

Hugging Face published a detailed tutorial demonstrating how to fine-tune Meta's LLaMA model using Reinforcement Learning from Human Feedback (RLHF) on StackExchange data. The guide covers the full pipeline: supervised fine-tuning, reward model training, and PPO-based RL optimization. It serves as a practical reference for practitioners seeking to replicate RLHF workflows on open-weight models using the TRL library.

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL

A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.

6Hugging Face Blog·1mo ago·source ↗

TRL v1.0: Post-Training Library Built to Move with the Field

Hugging Face has released TRL v1.0, a major milestone for its post-training library focused on reinforcement learning from human feedback and related alignment techniques. The release signals a stabilization of the API and feature set after iterative development tracking the rapidly evolving post-training landscape. TRL is widely used in the open-source community for fine-tuning and aligning language models using methods such as PPO, DPO, and GRPO.

5Hugging Face Blog·1mo ago·source ↗

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

Hugging Face's TRL library now supports co-locating vLLM inference alongside training on the same GPUs, eliminating the idle GPU problem that arises when separate inference and training processes alternate. This approach allows reinforcement learning from human feedback (RLHF) and online RL training pipelines to use GPUs continuously rather than leaving them idle during generation or gradient update phases. The integration targets efficiency gains in online RL training workflows such as GRPO and PPO, where generation and training steps previously required dedicated, alternating GPU allocations.

5Hugging Face Blog·1mo ago·source ↗

Putting RL back in RLHF: RLOO Implementation on Hugging Face

Hugging Face published a blog post introducing RLOO (REINFORCE Leave-One-Out), a reinforcement learning algorithm aimed at making the RL component of RLHF more practical and effective. The post discusses implementation details and motivations for revisiting pure RL-based fine-tuning approaches within the TRL library. This represents a technical contribution to the alignment and RLHF tooling ecosystem, offering an alternative to PPO-based RLHF pipelines.

5arXiv · cs.CL·1mo ago·source ↗

LamPO: Lambda-Style Policy Optimization with Pairwise Decomposed Advantage for Reasoning LMs

LamPO proposes a new RLVR training objective that replaces GRPO's scalar group-relative advantages with a Pairwise Decomposed Advantage, aggregating pairwise reward gaps within response groups and weighting comparisons by confidence-aware log-probability differences. The method retains a critic-free, clipped-update PPO-style structure and optionally adds a ROUGE-L-based dense auxiliary reward to reduce sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond using Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show consistent improvements over GRPO and other RLVR variants with more stable training dynamics.

6arXiv · cs.CL·19d ago·source ↗

CoRP: Gradient-Free Consolidation of Rewarded Perturbations for LLM Post-Training

CoRP (Consolidating Rewarded Perturbations) is a gradient-free post-training operator that folds an ensemble of reward-weighted weight-space perturbations into a single deployable model, eliminating the inference-time cost of ensemble methods like RandOpt. A split-half analysis across 25 model-task pairs reveals reproducible low-rank structure in the rewarded perturbation population, which CoRP exploits via reward-weighted aggregation, compatibility-aware reweighting, and a held-out validation gate. Evaluated on five models (0.5B–8B) across math, code, and creative writing, CoRP improves the base model by 8.1 points on average, exceeds single-inference RandOpt by 6.5 points using one-tenth the perturbation budget, and recovers more than half the gain of a 50-pass majority-vote ensemble at one forward pass per test example.

5arXiv · cs.LG·11d ago·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

5arXiv · cs.CL·11d ago·source ↗

AdvGRPO: Stable co-training framework for adaptive red teaming of language models

Researchers introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization in LLM red teaming, addressing previously reported instability. The method uses dense multi-channel rewards and decoupled advantage normalization, with a curriculum progressing from single-turn to multi-turn attacks before bootstrapping co-training. Co-trained defenders outperform baselines on safety benchmarks, and the attacks show transferability across models.