3OpenAI Blog·1mo ago

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

OpenAI published a research paper on variance reduction techniques for policy gradient methods in reinforcement learning. The work introduces action-dependent factorized baselines as a way to reduce variance in policy gradient estimates without introducing bias. This is a foundational RL training methodology contribution relevant to improving sample efficiency in reinforcement learning.

Alignment and RLHF Action-Dependent Factorized Baselines Policy Gradient Methods Variance Reduction OpenAI

Related guides (2)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

4arXiv · cs.LG·11d ago·source ↗

Agency-transferring technique improves RL policy training by bootstrapping from baseline policies

A new arXiv paper proposes a model-free reinforcement learning method that embeds an existing suboptimal baseline policy into training via an arbitration mechanism, progressively transferring control from the baseline to a trainable neural network. The approach yields high goal-reaching rates from the start of training and produces a standalone policy that outperforms the baseline without requiring it at inference time. Theoretical bounds on goal-reaching probability are derived, and empirical results on continuous-control benchmarks show competitive or superior returns compared to existing methods.

Alignment and RLHF An Agency-Transferring Model-Free Policy Enhancement Technique

4Openai Blog·1mo ago·source ↗

Evolved Policy Gradients: OpenAI Meta-Learning via Loss Function Evolution

OpenAI released Evolved Policy Gradients (EPG), a meta-learning method that evolves the loss function used to train reinforcement learning agents rather than hand-designing it. The approach enables faster adaptation to novel tasks, with agents demonstrating generalization to test-time scenarios outside their training distribution, such as navigating to objects placed in new locations. EPG represents an experimental direction in automated algorithm discovery for RL.

Agent and Tool Ecosystem Alignment and RLHF Evolved Policy Gradients meta-learning Reinforcement Learning +1 more

6arXiv · cs.LG·4d ago·source ↗

HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies

Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.

Alignment and RLHF Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes Hierarchical Advantage-Weighted Behavior Cloning

5arXiv · cs.LG·17d ago·source ↗

Reward uncertainty as a principled mechanism for diverse RL behaviour

A new arXiv preprint proposes replacing the scalar reward in RL with a distribution over reward functions, applying a non-linear objective over sets of actions to induce calibrated behavioural diversity without sacrificing expected reward. The authors derive a principled gradient estimator in the contextual bandit setting and prove the formulation generalizes vanilla policy gradient and action-set approaches. The work is motivated by applications like language model fine-tuning where diversity is desirable but entropy regularization and diversity bonuses introduce fragile trade-offs. Empirical results support the framework as a theoretically grounded alternative to heuristic diversity methods.

Evaluation and Benchmarking Alignment and RLHF Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

5Openai Blog·1mo ago·source ↗

Equivalence between Policy Gradients and Soft Q-Learning

OpenAI published a research result establishing a formal equivalence between policy gradient methods and soft Q-learning, two major families of reinforcement learning algorithms. The work shows that under entropy regularization, these approaches are mathematically equivalent, unifying previously separate lines of RL research. This has implications for algorithm design, theoretical understanding, and the development of hybrid RL methods.

Alignment and RLHF Policy Gradient Methods Entropy Regularization OpenAI +1 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL

A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.

Evaluation and Benchmarking Agent and Tool Ecosystem Leslie Pack Kaelbling Divide-and-Conquer Value Learning Berkeley AI Research (BAIR)+8 more

7Openai Blog·1mo ago·source ↗

Scaling Laws for Reward Model Overoptimization

OpenAI published research investigating how reward model overoptimization scales with policy and reward model size in RLHF pipelines. The work characterizes the relationship between KL divergence from the initial policy and gold-standard reward, finding predictable degradation patterns as optimization pressure increases. This provides empirical grounding for understanding Goodhart's Law dynamics in language model fine-tuning and has implications for designing safer, more robust RLHF training regimes.

Evaluation and Benchmarking AI Safety Research KL Divergence Goodhart's Law Scaling Laws for Reward Model Overoptimization +3 more

7arXiv · cs.AI·29d ago·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

Evaluation and Benchmarking Inference Economics GRPO pass@k AlphaEvolve +4 more