One-shot imitation learning
OpenAI published research on one-shot imitation learning, a technique enabling agents to learn new tasks from a single demonstration. The approach allows a policy network to observe a demonstration and immediately generalize to new instances of the same task without additional training. This was an early contribution to the field of meta-learning and few-shot generalization in robotics and sequential decision-making.
Related guides (3)
Related events (8)
Learning Montezuma's Revenge from a Single Demonstration
OpenAI trained a reinforcement learning agent to achieve a score of 74,500 on Montezuma's Revenge using a single human demonstration, surpassing all previously published results. The method is straightforward: the agent plays episodes starting from carefully selected states drawn from the demonstration, optimizing game score via PPO. This approach demonstrates that imitation-seeded curriculum learning can dramatically improve exploration in hard-exploration environments. The same PPO algorithm underpins OpenAI Five.
Imitation learning technique infers red agent policy in partially observable cyber-defense environments
Researchers propose a Policy Learning Technique using imitation learning to infer attacker (red agent) policies from network observations and defender actions in partially observable autonomous cyber environments. The method integrates with neurosymbolic cyber-defense agents that use behavior trees with learning-enabled components. Evaluated across diverse simulated scenarios, the approach achieves high prediction accuracy for red agent actions, improving the defender's ability to anticipate intrusions.
Generalizing from Simulation: OpenAI Sim-to-Real Robotics Transfer
OpenAI published results on sim-to-real transfer for robot controllers, demonstrating that policies trained entirely in simulation can be deployed on physical robots and respond to unplanned environmental changes. The work represents a shift from open-loop to closed-loop control systems in robotics. This is a 2017 research milestone predating current frontier model work but relevant to the historical trajectory of OpenAI's robotics program.
RL²: Fast Reinforcement Learning via Slow Reinforcement Learning
OpenAI published RL², a meta-reinforcement learning approach in which a slow outer RL process trains a recurrent neural network whose hidden state encodes a fast inner learning algorithm. The method allows agents to rapidly adapt to new tasks within a single episode by leveraging experience accumulated across many training tasks. This work is an early foundational contribution to meta-learning for RL, predating the modern agent and LLM era but relevant to understanding the intellectual lineage of in-context and few-shot learning in AI systems.
On First-Order Meta-Learning Algorithms
OpenAI published research on first-order meta-learning algorithms, presenting simplified variants of MAML (Model-Agnostic Meta-Learning) that omit second-order derivatives while retaining competitive performance. The work demonstrates that first-order approximations are surprisingly effective for few-shot learning tasks. This contributed to the broader understanding of gradient-based meta-learning efficiency and scalability.
DARP: Semi-parametric retrieval-based imitation learning reduces compounding errors by 15-46%
Researchers introduce DARP (Difference-Aware Retrieval Policies), a semi-parametric imitation learning method that retrieves k-nearest neighbor demonstrations at inference time and predicts actions based on relative distance vectors between neighbor and query states. The approach reparameterizes behavior cloning around local neighborhood structure rather than global state-to-action mappings, requiring no additional data collection or online expert feedback. Across continuous control and robotic manipulation tasks, DARP shows 15-46% performance improvements over standard behavior cloning, including on high-dimensional visual inputs.
Learning from Human Preferences: OpenAI and DeepMind Collaborate on Reward Learning from Comparisons
OpenAI, in collaboration with DeepMind's safety team, published a method for learning reward functions directly from human preference comparisons between pairs of agent behaviors, eliminating the need to hand-code goal functions. The algorithm infers human intent by asking evaluators which of two proposed behaviors is preferable, addressing risks from misspecified reward functions. This work is an early foundational contribution to what would become reinforcement learning from human feedback (RLHF). It targets both safety and alignment concerns around reward hacking and proxy gaming.
Language models are few-shot learners
OpenAI published the GPT-3 paper introducing a 175-billion-parameter autoregressive language model demonstrating strong few-shot learning capabilities across a wide range of NLP tasks. The work showed that scaling language models dramatically improves task-agnostic, few-shot performance, often matching or exceeding fine-tuned models without any gradient updates. This paper became a foundational milestone in the development of large language models and the modern AI landscape.


