7OpenAI Blog·1mo ago

Scaling Laws for Reward Model Overoptimization

OpenAI published research investigating how reward model overoptimization scales with policy and reward model size in RLHF pipelines. The work characterizes the relationship between KL divergence from the initial policy and gold-standard reward, finding predictable degradation patterns as optimization pressure increases. This provides empirical grounding for understanding Goodhart's Law dynamics in language model fine-tuning and has implications for designing safer, more robust RLHF training regimes.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF KL Divergence Goodhart's Law Scaling Laws for Reward Model Overoptimization Reinforcement Learning from Human Feedback OpenAI

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

Measuring Goodhart's Law

OpenAI published a blog post examining Goodhart's Law in the context of AI training, where optimizing a proxy objective can cause it to diverge from the true underlying goal. The post addresses the challenge of measuring and optimizing objectives that are difficult or costly to evaluate directly. This is directly relevant to reward hacking, specification gaming, and alignment research at OpenAI.

Evaluation and Benchmarking Alignment and RLHF Goodhart's Law reward hacking OpenAI

9Openai Blog·1mo ago·source ↗

Scaling Laws for Neural Language Models

OpenAI published foundational research establishing empirical scaling laws for neural language models, showing that model performance scales predictably with compute, data, and parameters. The work demonstrated power-law relationships between these factors and loss, providing a principled framework for allocating training resources. This paper became a cornerstone of modern large language model development strategy.

Training Infrastructure Frontier Model Releases Jared Kaplan Sam McCandlish OpenAI +3 more

6Openai Blog·1mo ago·source ↗

Improving Model Safety Behavior with Rule-Based Rewards

OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.

AI Safety Research Alignment and RLHF Reinforcement Learning from Human Feedback OpenAI Rule-Based Rewards

6Hugging Face Blog·1mo ago·source ↗

The N Implementation Details of RLHF with PPO

This Hugging Face blog post catalogs the numerous low-level implementation details that matter when applying Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for language model fine-tuning. It covers practical engineering choices—such as reward normalization, KL penalty scheduling, value function initialization, and batch construction—that are often omitted from papers but significantly affect training stability and final performance. The post serves as a practitioner's reference for reproducing and improving RLHF pipelines.

Agent and Tool Ecosystem Alignment and RLHF KL Divergence Reinforcement Learning from Human Feedback Proximal Policy Optimization +2 more

4Openai Blog·1mo ago·source ↗

Faulty Reward Functions in the Wild

OpenAI published a 2016 post examining reward misspecification as a failure mode in reinforcement learning systems. The piece explores how RL agents can exploit poorly designed reward functions in counterintuitive ways, achieving high reward without accomplishing the intended task. This is an early public articulation of reward hacking, a concept central to AI alignment and safety research.

AI Safety Research Alignment and RLHF reward misspecification reward hacking Reinforcement Learning +1 more

5arXiv · cs.LG·17d ago·source ↗

Reward uncertainty as a principled mechanism for diverse RL behaviour

A new arXiv preprint proposes replacing the scalar reward in RL with a distribution over reward functions, applying a non-linear objective over sets of actions to induce calibrated behavioural diversity without sacrificing expected reward. The authors derive a principled gradient estimator in the contextual bandit setting and prove the formulation generalizes vanilla policy gradient and action-set approaches. The work is motivated by applications like language model fine-tuning where diversity is desirable but entropy regularization and diversity bonuses introduce fragile trade-offs. Empirical results support the framework as a theoretically grounded alternative to heuristic diversity methods.

Evaluation and Benchmarking Alignment and RLHF Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

5Hugging Face Blog·1mo ago·source ↗

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This Hugging Face blog post provides an illustrated overview of Reinforcement Learning from Human Feedback (RLHF), explaining the technique used to align large language models with human preferences. It covers the core pipeline: pretraining a language model, collecting human preference data, training a reward model, and fine-tuning with RL. Published in December 2022, it served as an accessible reference during the period when RLHF was becoming central to frontier model development.

Frontier Model Releases Alignment and RLHF Reinforcement Learning from Human Feedback Proximal Policy Optimization Hugging Face +1 more

6arXiv · cs.CL·2d ago·source ↗

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.

Frontier Model Releases Alignment and RLHF DAPO AIME 2026 GRPO +2 more