Almanac
← Events
7OpenAI Blog·1mo ago

Scaling Laws for Reward Model Overoptimization

OpenAI published research investigating how reward model overoptimization scales with policy and reward model size in RLHF pipelines. The work characterizes the relationship between KL divergence from the initial policy and gold-standard reward, finding predictable degradation patterns as optimization pressure increases. This provides empirical grounding for understanding Goodhart's Law dynamics in language model fine-tuning and has implications for designing safer, more robust RLHF training regimes.

Related guides (3)

Related events (8)

5Openai Blog·1mo ago·source ↗

Measuring Goodhart's Law

OpenAI published a blog post examining Goodhart's Law in the context of AI training, where optimizing a proxy objective can cause it to diverge from the true underlying goal. The post addresses the challenge of measuring and optimizing objectives that are difficult or costly to evaluate directly. This is directly relevant to reward hacking, specification gaming, and alignment research at OpenAI.

9Openai Blog·1mo ago·source ↗

Scaling Laws for Neural Language Models

OpenAI published foundational research establishing empirical scaling laws for neural language models, showing that model performance scales predictably with compute, data, and parameters. The work demonstrated power-law relationships between these factors and loss, providing a principled framework for allocating training resources. This paper became a cornerstone of modern large language model development strategy.

6Openai Blog·1mo ago·source ↗

Improving Model Safety Behavior with Rule-Based Rewards

OpenAI has developed a method called Rule-Based Rewards (RBRs) that trains models to behave safely without requiring extensive human data collection. The approach uses explicit rules to generate reward signals during training, offering a more scalable alternative to traditional RLHF-based safety alignment. This represents a practical contribution to alignment methodology from a Tier 1 lab.

6Hugging Face Blog·1mo ago·source ↗

The N Implementation Details of RLHF with PPO

This Hugging Face blog post catalogs the numerous low-level implementation details that matter when applying Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for language model fine-tuning. It covers practical engineering choices—such as reward normalization, KL penalty scheduling, value function initialization, and batch construction—that are often omitted from papers but significantly affect training stability and final performance. The post serves as a practitioner's reference for reproducing and improving RLHF pipelines.

4Openai Blog·1mo ago·source ↗

Faulty Reward Functions in the Wild

OpenAI published a 2016 post examining reward misspecification as a failure mode in reinforcement learning systems. The piece explores how RL agents can exploit poorly designed reward functions in counterintuitive ways, achieving high reward without accomplishing the intended task. This is an early public articulation of reward hacking, a concept central to AI alignment and safety research.

5arXiv · cs.LG·17d ago·source ↗

Reward uncertainty as a principled mechanism for diverse RL behaviour

A new arXiv preprint proposes replacing the scalar reward in RL with a distribution over reward functions, applying a non-linear objective over sets of actions to induce calibrated behavioural diversity without sacrificing expected reward. The authors derive a principled gradient estimator in the contextual bandit setting and prove the formulation generalizes vanilla policy gradient and action-set approaches. The work is motivated by applications like language model fine-tuning where diversity is desirable but entropy regularization and diversity bonuses introduce fragile trade-offs. Empirical results support the framework as a theoretically grounded alternative to heuristic diversity methods.

5Hugging Face Blog·1mo ago·source ↗

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This Hugging Face blog post provides an illustrated overview of Reinforcement Learning from Human Feedback (RLHF), explaining the technique used to align large language models with human preferences. It covers the core pipeline: pretraining a language model, collecting human preference data, training a reward model, and fine-tuning with RL. Published in December 2022, it served as an accessible reference during the period when RLHF was becoming central to frontier model development.

6arXiv · cs.CL·2d ago·source ↗

STARE: Token-level advantage reweighting to prevent entropy collapse in GRPO-style RL training

Researchers introduce STARE, a method addressing policy entropy collapse in GRPO-style reinforcement learning from verifiable rewards (RLVR) for LLM post-training. Through first-order gradient analysis, they identify a token-level credit assignment mismatch and propose selectively reweighting advantages for entropy-critical tokens using batch-internal surprisal quantiles plus a closed-loop entropy gate. Evaluated across 1.5B–32B models on short/long chain-of-thought and multi-turn tool use tasks, STARE outperforms DAPO and other baselines by 4–8% on AIME24/25 while sustaining stable training over thousands of steps.