Learning Montezuma's Revenge from a Single Demonstration
OpenAI trained a reinforcement learning agent to achieve a score of 74,500 on Montezuma's Revenge using a single human demonstration, surpassing all previously published results. The method is straightforward: the agent plays episodes starting from carefully selected states drawn from the demonstration, optimizing game score via PPO. This approach demonstrates that imitation-seeded curriculum learning can dramatically improve exploration in hard-exploration environments. The same PPO algorithm underpins OpenAI Five.
Related guides (4)
Related events (8)
Dota 2 with Large Scale Deep Reinforcement Learning
OpenAI published a detailed account of the OpenAI Five system that defeated world-champion Dota 2 players using large-scale deep reinforcement learning. The work describes the training infrastructure, self-play curriculum, and scaling properties that enabled superhuman performance in a complex multi-agent environment. This represents a landmark result in applying RL at scale to long-horizon, high-dimensional tasks.
More on Dota 2: OpenAI Self-Play Reaches Superhuman Performance
OpenAI reports that a self-play reinforcement learning system progressed from below high-ranked human level to beating top professional Dota 2 players within one month, using only 1v1 mid-lane play. The post highlights self-play as a mechanism that automatically improves training data quality as the agent improves, contrasting it with supervised learning's dependence on fixed datasets. The result is presented as evidence that sufficient compute combined with self-play can rapidly close and exceed human-level performance gaps.
Learning from Human Preferences: OpenAI and DeepMind Collaborate on Reward Learning from Comparisons
OpenAI, in collaboration with DeepMind's safety team, published a method for learning reward functions directly from human preference comparisons between pairs of agent behaviors, eliminating the need to hand-code goal functions. The algorithm infers human intent by asking evaluators which of two proposed behaviors is preferable, addressing risks from misspecified reward functions. This work is an early foundational contribution to what would become reinforcement learning from human feedback (RLHF). It targets both safety and alignment concerns around reward hacking and proxy gaming.
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
OpenAI published research showing that evolution strategies (ES), a decades-old optimization technique, can match standard reinforcement learning performance on benchmarks like Atari and MuJoCo. The approach offers practical advantages over RL including easier parallelization and fewer hyperparameter sensitivities. This positions ES as a viable alternative training paradigm for policy optimization tasks.
Reinforcement Learning with Prediction-Based Rewards (Random Network Distillation)
OpenAI introduces Random Network Distillation (RND), a curiosity-driven exploration method for reinforcement learning that uses prediction error on a fixed random neural network as an intrinsic reward signal. RND is the first method to exceed average human performance on Montezuma's Revenge, a notoriously hard-exploration Atari game. The approach is simple to implement and compatible with standard RL algorithms, offering a scalable alternative to count-based or dynamics-model exploration bonuses.
OpenAI Five Defeats Amateur Human Teams at Dota 2
OpenAI announced that OpenAI Five, a team of five neural networks trained via self-play, has begun defeating amateur human teams at Dota 2. This represented an early milestone in applying reinforcement learning to complex, long-horizon multi-agent environments. The system was trained using large-scale distributed RL, demonstrating that neural networks could coordinate in real-time strategy games without hand-crafted rules.
Mini-R1: Reproducing DeepSeek R1 'Aha Moment' — An RL Tutorial
A Hugging Face blog post demonstrates how to reproduce DeepSeek R1's emergent 'aha moment' reasoning behavior using reinforcement learning on a countdown game task. The tutorial walks through training a smaller model with RL to exhibit chain-of-thought self-correction, similar to the behavior observed in DeepSeek R1. This serves as a practical open-source replication effort aimed at demystifying R1's training dynamics.
OpenAI Five Defeats 99.95th Percentile Dota 2 Players in Live Benchmark Match
OpenAI Five won a best-of-three series against a team of five high-ranked Dota 2 players, four of whom are professional players, in a live event watched by approximately 100,000 concurrent viewers. The match was framed as a benchmark result demonstrating the system's capability against near-top-tier human competition. This represents a milestone in the ongoing development of OpenAI's reinforcement learning-based Dota 2 agent.



