KL Divergence
kl-divergence-cf6299f5·3 events·first seen 28d agoAliases: KL Divergence, KL Divergence Penalty
Merged from
KL Divergence Penalty
Co-occurring entities
More like this (12)
Recent events (3)
Scaling Laws for Reward Model Overoptimization
OpenAI published research investigating how reward model overoptimization scales with policy and reward model size in RLHF pipelines. The work characterizes the relationship between KL divergence from the initial policy and gold-standard reward, finding predictable degradation patterns as optimization pressure increases. This provides empirical grounding for understanding Goodhart's Law dynamics in language model fine-tuning and has implications for designing safer, more robust RLHF training regimes.
The N Implementation Details of RLHF with PPO
This Hugging Face blog post catalogs the numerous low-level implementation details that matter when applying Reinforcement Learning from Human Feedback (RLHF) using Proximal Policy Optimization (PPO) for language model fine-tuning. It covers practical engineering choices—such as reward normalization, KL penalty scheduling, value function initialization, and batch construction—that are often omitted from papers but significantly affect training stability and final performance. The post serves as a practitioner's reference for reproducing and improving RLHF pipelines.
Perturbation Theory for Spherical Hellinger-Kantorovich Flows with Differential Privacy Guarantees
This paper develops a perturbation theory for Spherical Hellinger-Kantorovich (SHK) gradient flows, which couple transport and reaction dynamics and coincide with birth-death Langevin dynamics. The authors derive dimension-free bounds on log-likelihood ratios and Rényi/KL divergences when two potentials differ, quantifying how perturbations propagate over time. These results are applied to differential privacy: the likelihood-ratio control yields explicit Pure-DP guarantees for SHK-based samplers implementing the exponential mechanism, while KL bounds provide Approximate-DP certificates. A utility bound is also derived that separates intrinsic exponential-mechanism suboptimality from finite-time sampling error.