OpenAI Develops Hierarchical Reinforcement Learning Algorithm for Long-Horizon Tasks
OpenAI published research on a hierarchical reinforcement learning (HRL) algorithm that learns reusable high-level actions to solve tasks requiring thousands of timesteps. Applied to navigation problems, the algorithm discovers locomotion primitives (walking, crawling in various directions) that enable rapid mastery of new tasks. The approach addresses a core challenge in RL: efficient exploration and transfer across long-horizon tasks.
Related guides (2)
Related events (8)
HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies
Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.
LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards
LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.
Putting RL back in RLHF: RLOO Implementation on Hugging Face
Hugging Face published a blog post introducing RLOO (REINFORCE Leave-One-Out), a reinforcement learning algorithm aimed at making the RL component of RLHF more practical and effective. The post discusses implementation details and motivations for revisiting pure RL-based fine-tuning approaches within the TRL library. This represents a technical contribution to the alignment and RLHF tooling ecosystem, offering an alternative to PPO-based RLHF pipelines.
RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL
A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.
AgenticRL: Self-refining LLM-guided reward design and policy refinement for UAV navigation
AgenticRL is a framework that uses a multimodal GPT agent to automate reward function generation, policy training via PPO, and closed-loop self-refinement for UAV navigation tasks. The agent evaluates trained policies through diagnostic feedback, identifies failure modes, and iteratively refines rewards without human intervention. Evaluated across five navigation tasks, the closed-loop refinement improves policy behavior by 71% over initial rewards, with sim-to-real transfer achieving 91% real-world success rate and 94% sim-to-real accuracy.
Dota 2 with Large Scale Deep Reinforcement Learning
OpenAI published a detailed account of the OpenAI Five system that defeated world-champion Dota 2 players using large-scale deep reinforcement learning. The work describes the training infrastructure, self-play curriculum, and scaling properties that enabled superhuman performance in a complex multi-agent environment. This represents a landmark result in applying RL at scale to long-horizon, high-dimensional tasks.
Illustrating Reinforcement Learning from Human Feedback (RLHF)
This Hugging Face blog post provides an illustrated overview of Reinforcement Learning from Human Feedback (RLHF), explaining the technique used to align large language models with human preferences. It covers the core pipeline: pretraining a language model, collecting human preference data, training a reward model, and fine-tuning with RL. Published in December 2022, it served as an accessible reference during the period when RLHF was becoming central to frontier model development.
RL²: Fast Reinforcement Learning via Slow Reinforcement Learning
OpenAI published RL², a meta-reinforcement learning approach in which a slow outer RL process trains a recurrent neural network whose hidden state encodes a fast inner learning algorithm. The method allows agents to rapidly adapt to new tasks within a single episode by leveraging experience accumulated across many training tasks. This work is an early foundational contribution to meta-learning for RL, predating the modern agent and LLM era but relevant to understanding the intellectual lineage of in-context and few-shot learning in AI systems.

