Entity · technique

DPPO

techniqueactivedppo-cddd23ad·1 events·first seen Jun 9, 2026

Aliases: DPPO

Co-occurring entities

Divergence Regularized Policy Optimization GRPO PPO

More like this (12)

DDPO DPO DPOT DOPD DualDPO SDPO FlowDPO SPPO PPO DDPM DAPO d-OPSD

Recent events (1)

5arXiv · cs.LG·Jun 9, 2026·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

Alignment and RLHF Divergence Regularized Policy Optimization GRPO PPO +1 more