6Berkeley AI Research (BAIR) Blog·1mo ago

RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL

A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.

Evaluation and Benchmarking Agent and Tool Ecosystem Alignment and RLHF Leslie Pack Kaelbling Divide-and-Conquer Value Learning Berkeley AI Research (BAIR)GRPO PPO Aditya (co-lead author)Temporal Difference Learning Floyd-Warshall Algorithm Goal-Conditioned Reinforcement Learning Q-learning

Related guides (4)

PPOConcept

PPO: The Reinforcement Learning Algorithm That Taught AI to Learn from Feedback

Read asBeginner In-depth

GRPOConcept

GRPO: The Lightweight RL Trick Behind Today's Reasoning Models

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Alignment and RLHFTopic guide

Alignment and RLHF: From Human Feedback to Scalable Post-Training

Read asIn-depth

Related events (8)

6arXiv · cs.LG·4d ago·source ↗

HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies

Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.

Alignment and RLHF Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes Hierarchical Advantage-Weighted Behavior Cloning

6arXiv · cs.CL·29d ago·source ↗

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

This paper proposes a multi-reward reinforcement learning from internal feedback (RLIF) framework that decomposes training signals into an answer-level reward via cluster voting and a completion-level reward via token-wise self-certainty. To address reward hacking and entropy collapse common in single-reward RLIF, the authors introduce GDPO-based normalization and KL-Cov regularization targeting low-entropy token distributions. Evaluated on mathematical reasoning and code-generation benchmarks, the method achieves stability and performance approaching supervised RLVR methods without requiring external ground-truth supervision. The work advances scalable unsupervised RL training for LLM reasoning.

AI Safety Research Alignment and RLHF KL-Cov regularization token-wise self-certainty cluster voting reward +3 more

6arXiv · cs.CL·1mo ago·source ↗

DelTA: Discriminative Token Credit Assignment for RLVR Training

DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.

Frontier Model Releases Evaluation and Benchmarking DelTA Qwen3-8B-Base policy gradient +5 more

4Openai Blog·1mo ago·source ↗

OpenAI Develops Hierarchical Reinforcement Learning Algorithm for Long-Horizon Tasks

OpenAI published research on a hierarchical reinforcement learning (HRL) algorithm that learns reusable high-level actions to solve tasks requiring thousands of timesteps. Applied to navigation problems, the algorithm discovers locomotion primitives (walking, crawling in various directions) that enable rapid mastery of new tasks. The approach addresses a core challenge in RL: efficient exploration and transfer across long-horizon tasks.

Agent and Tool Ecosystem OpenAI Hierarchical Reinforcement Learning

6arXiv · cs.AI·19d ago·source ↗

ReuseRL: Skill Reuse as Compression in Agentic RL via MDL Principle

ReuseRL formalizes agentic reinforcement learning through the Minimum Description Length (MDL) principle, extracting a shared skill dictionary from successful trajectories and augmenting the RL objective with a segmentation cost that penalizes idiosyncratic, non-reusable behaviors. The authors prove a PAC-Bayes generalization bound for this compression penalty. Evaluated on ALFWorld, TextWorld-Cooking, and Countdown-Stepwise, ReuseRL outperforms vanilla GRPO and round-length baselines on both in-distribution and out-of-distribution tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem Minimum Description Length ALFWorld Countdown-Stepwise +5 more

6arXiv · cs.AI·16d ago·source ↗

DistIL: Distributional DAgger for RL from Rich Feedback beyond single-bit rewards

A new arXiv preprint introduces DistIL, a distributional variant of the DAgger imitation learning algorithm designed to exploit rich feedback signals (execution traces, tool outputs, expert corrections) rather than the single-bit correctness reward used in standard RLVR. The method uses a forward cross-entropy objective that provides monotonic policy improvement guarantees, unlike reverse KL or Jensen-Shannon divergence objectives used in prior self-distillation approaches. Empirically, DistIL outperforms RLVR and self-distillation baselines on scientific reasoning, coding, and hard math benchmarks.

Frontier Model Releases Alignment and RLHF DAgger DistIL Reinforcement Learning with Verifiable Rewards +1 more

6arXiv · cs.CL·19d ago·source ↗

LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards

LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.

Long Context Evolution Evaluation and Benchmarking tiered distractors Knowledge Graph Random Walk Long-context Reasoning Benchmarks +8 more

4arXiv · cs.LG·11d ago·source ↗

Agency-transferring technique improves RL policy training by bootstrapping from baseline policies

A new arXiv paper proposes a model-free reinforcement learning method that embeds an existing suboptimal baseline policy into training via an arbitration mechanism, progressively transferring control from the baseline to a trainable neural network. The approach yields high goal-reaching rates from the start of training and produces a standalone policy that outperforms the baseline without requiring it at inference time. Theoretical bounds on goal-reaching probability are derived, and empirical results on continuous-control benchmarks show competitive or superior returns compared to existing methods.

Alignment and RLHF An Agency-Transferring Model-Free Policy Enhancement Technique