5OpenAI Blog·1mo ago

Learning Montezuma's Revenge from a Single Demonstration

OpenAI trained a reinforcement learning agent to achieve a score of 74,500 on Montezuma's Revenge using a single human demonstration, surpassing all previously published results. The method is straightforward: the agent plays episodes starting from carefully selected states drawn from the demonstration, optimizing game score via PPO. This approach demonstrates that imitation-seeded curriculum learning can dramatically improve exploration in hard-exploration environments. The same PPO algorithm underpins OpenAI Five.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI Five PPO OpenAI Montezuma's Revenge

Related guides (4)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

PPOConcept

PPO: The Reinforcement Learning Algorithm That Taught AI to Learn from Feedback

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Read asIn-depth

Related events (8)

6Openai Blog·1mo ago·source ↗

Dota 2 with Large Scale Deep Reinforcement Learning

OpenAI published a detailed account of the OpenAI Five system that defeated world-champion Dota 2 players using large-scale deep reinforcement learning. The work describes the training infrastructure, self-play curriculum, and scaling properties that enabled superhuman performance in a complex multi-agent environment. This represents a landmark result in applying RL at scale to long-horizon, high-dimensional tasks.

Training Infrastructure AI Safety Research OpenAI Five Dota 2 Proximal Policy Optimization +1 more

6Openai Blog·1mo ago·source ↗

More on Dota 2: OpenAI Self-Play Reaches Superhuman Performance

OpenAI reports that a self-play reinforcement learning system progressed from below high-ranked human level to beating top professional Dota 2 players within one month, using only 1v1 mid-lane play. The post highlights self-play as a mechanism that automatically improves training data quality as the agent improves, contrasting it with supervised learning's dependence on fixed datasets. The result is presented as evidence that sufficient compute combined with self-play can rapidly close and exceed human-level performance gaps.

Evaluation and Benchmarking Agent and Tool Ecosystem self-play OpenAI Five Dota 2 +2 more

7Openai Blog·1mo ago·source ↗

Learning from Human Preferences: OpenAI and DeepMind Collaborate on Reward Learning from Comparisons

OpenAI, in collaboration with DeepMind's safety team, published a method for learning reward functions directly from human preference comparisons between pairs of agent behaviors, eliminating the need to hand-code goal functions. The algorithm infers human intent by asking evaluators which of two proposed behaviors is preferable, addressing risks from misspecified reward functions. This work is an early foundational contribution to what would become reinforcement learning from human feedback (RLHF). It targets both safety and alignment concerns around reward hacking and proxy gaming.

Evaluation and Benchmarking AI Safety Research Reward Learning from Comparisons DeepMind Reinforcement Learning from Human Feedback +2 more

5Openai Blog·1mo ago·source ↗

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

OpenAI published research showing that evolution strategies (ES), a decades-old optimization technique, can match standard reinforcement learning performance on benchmarks like Atari and MuJoCo. The approach offers practical advantages over RL including easier parallelization and fewer hyperparameter sensitivities. This positions ES as a viable alternative training paradigm for policy optimization tasks.

Evaluation and Benchmarking Alignment and RLHF Evolution Strategies MuJoCo Reinforcement Learning +2 more

6Openai Blog·1mo ago·source ↗

Reinforcement Learning with Prediction-Based Rewards (Random Network Distillation)

OpenAI introduces Random Network Distillation (RND), a curiosity-driven exploration method for reinforcement learning that uses prediction error on a fixed random neural network as an intrinsic reward signal. RND is the first method to exceed average human performance on Montezuma's Revenge, a notoriously hard-exploration Atari game. The approach is simple to implement and compatible with standard RL algorithms, offering a scalable alternative to count-based or dynamics-model exploration bonuses.

Evaluation and Benchmarking AI Safety Research OpenAI Random Network Distillation Yuri Burda +2 more

6Openai Blog·1mo ago·source ↗

OpenAI Five Defeats Amateur Human Teams at Dota 2

OpenAI announced that OpenAI Five, a team of five neural networks trained via self-play, has begun defeating amateur human teams at Dota 2. This represented an early milestone in applying reinforcement learning to complex, long-horizon multi-agent environments. The system was trained using large-scale distributed RL, demonstrating that neural networks could coordinate in real-time strategy games without hand-crafted rules.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI Five Dota 2 Proximal Policy Optimization +1 more

5Hugging Face Blog·1mo ago·source ↗

Mini-R1: Reproducing DeepSeek R1 'Aha Moment' — An RL Tutorial

A Hugging Face blog post demonstrates how to reproduce DeepSeek R1's emergent 'aha moment' reasoning behavior using reinforcement learning on a countdown game task. The tutorial walks through training a smaller model with RL to exhibit chain-of-thought self-correction, similar to the behavior observed in DeepSeek R1. This serves as a practical open-source replication effort aimed at demystifying R1's training dynamics.

Frontier Model Releases Open Weights Progress DeepSeek V4 GRPO Open R1 +3 more

6Openai Blog·1mo ago·source ↗

OpenAI Five Defeats 99.95th Percentile Dota 2 Players in Live Benchmark Match

OpenAI Five won a best-of-three series against a team of five high-ranked Dota 2 players, four of whom are professional players, in a live event watched by approximately 100,000 concurrent viewers. The match was framed as a benchmark result demonstrating the system's capability against near-top-tier human competition. This represents a milestone in the ongoing development of OpenAI's reinforcement learning-based Dota 2 agent.

Evaluation and Benchmarking Agent and Tool Ecosystem Merlini OpenAI Five Cap +5 more