other

Goodhart's Law

otheractivegoodhart-s-law-105774e0·2 events·first seen 28d ago

Aliases: Goodhart's Law

Co-occurring entities

OpenAI reward hacking KL Divergence Scaling Laws for Reward Model Overoptimization Reinforcement Learning from Human Feedback

More like this (12)

Moore's Law Shannon Scaling Law Pareto curves power-law scaling cyberattack scaling law Factual Recall Scaling Law Moravec's Paradox Parametric Memory Law Matching Principle Anthropic Usage Policy AI scaling laws Shannon-Hartley Theorem

Recent events (2)

5Openai Blog·28d ago·source ↗

Measuring Goodhart's Law

OpenAI published a blog post examining Goodhart's Law in the context of AI training, where optimizing a proxy objective can cause it to diverge from the true underlying goal. The post addresses the challenge of measuring and optimizing objectives that are difficult or costly to evaluate directly. This is directly relevant to reward hacking, specification gaming, and alignment research at OpenAI.

Evaluation and Benchmarking Alignment and RLHF Goodhart's Law reward hacking OpenAI

7Openai Blog·28d ago·source ↗

Scaling Laws for Reward Model Overoptimization

OpenAI published research investigating how reward model overoptimization scales with policy and reward model size in RLHF pipelines. The work characterizes the relationship between KL divergence from the initial policy and gold-standard reward, finding predictable degradation patterns as optimization pressure increases. This provides empirical grounding for understanding Goodhart's Law dynamics in language model fine-tuning and has implications for designing safer, more robust RLHF training regimes.

Evaluation and Benchmarking AI Safety Research KL Divergence Goodhart's Law Scaling Laws for Reward Model Overoptimization +3 more