5OpenAI Blog·1mo ago

RL²: Fast Reinforcement Learning via Slow Reinforcement Learning

OpenAI published RL², a meta-reinforcement learning approach in which a slow outer RL process trains a recurrent neural network whose hidden state encodes a fast inner learning algorithm. The method allows agents to rapidly adapt to new tasks within a single episode by leveraging experience accumulated across many training tasks. This work is an early foundational contribution to meta-learning for RL, predating the modern agent and LLM era but relevant to understanding the intellectual lineage of in-context and few-shot learning in AI systems.

Agent and Tool Ecosystem Alignment and RLHF Recurrent Neural Network Reinforcement Learning OpenAI RL²

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

PipelineRL: ServiceNow's Pipeline-Based Reinforcement Learning Framework for LLMs

ServiceNow introduces PipelineRL, a reinforcement learning training framework for large language models published via the Hugging Face blog. The post describes a pipeline-based approach to RL training, likely addressing throughput and efficiency challenges in RLHF or similar post-training workflows. As a tier-2 source with minimal body content, the technical depth is unclear but the topic is relevant to alignment and training infrastructure.

Training Infrastructure Agent and Tool Ecosystem ServiceNow AI PipelineRL Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

Putting RL back in RLHF: RLOO Implementation on Hugging Face

Hugging Face published a blog post introducing RLOO (REINFORCE Leave-One-Out), a reinforcement learning algorithm aimed at making the RL component of RLHF more practical and effective. The post discusses implementation details and motivations for revisiting pure RL-based fine-tuning approaches within the TRL library. This represents a technical contribution to the alignment and RLHF tooling ecosystem, offering an alternative to PPO-based RLHF pipelines.

Agent and Tool Ecosystem Alignment and RLHF RLOO Reinforcement Learning from Human Feedback PPO +2 more

4Openai Blog·1mo ago·source ↗

OpenAI Develops Hierarchical Reinforcement Learning Algorithm for Long-Horizon Tasks

OpenAI published research on a hierarchical reinforcement learning (HRL) algorithm that learns reusable high-level actions to solve tasks requiring thousands of timesteps. Applied to navigation problems, the algorithm discovers locomotion primitives (walking, crawling in various directions) that enable rapid mastery of new tasks. The approach addresses a core challenge in RL: efficient exploration and transfer across long-horizon tasks.

Agent and Tool Ecosystem OpenAI Hierarchical Reinforcement Learning

3Openai Blog·1mo ago·source ↗

Some considerations on learning to explore via meta-reinforcement learning

OpenAI published a research post examining exploration strategies learned through meta-reinforcement learning. The work investigates how agents can acquire exploration behaviors through meta-learning rather than having them hand-designed. This is an early OpenAI contribution to the intersection of meta-learning and RL, predating the current frontier model era.

Alignment and RLHF Reinforcement Learning OpenAI

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL

A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.

Evaluation and Benchmarking Agent and Tool Ecosystem Leslie Pack Kaelbling Divide-and-Conquer Value Learning Berkeley AI Research (BAIR)+8 more

6arXiv · cs.LG·4d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL

5arXiv · cs.CL·5d ago·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld Sokoban RePro +2 more

6arXiv · cs.CL·19d ago·source ↗

LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards

LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.

Long Context Evolution Evaluation and Benchmarking tiered distractors Knowledge Graph Random Walk Long-context Reasoning Benchmarks +8 more