Entity · technique

Divergence Regularized Policy Optimization

techniqueactivedivergence-regularized-policy-optimization-f1c641ee·1 events·first seen Jun 9, 2026

Aliases: Divergence Regularized Policy Optimization

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.LG·Jun 9, 2026·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

Alignment and RLHF Divergence Regularized Policy Optimization GRPO PPO +1 more