technique
Divergence Regularized Policy Optimization
techniqueactiveprovisional
divergence-regularized-policy-optimization-f1c641ee·1 events·first seen 8d agoAliases: Divergence Regularized Policy Optimization
Co-occurring entities
More like this (12)
Denoising Diffusion Policy OptimizationProximal Policy OptimizationEntropy-Regularized Reinforcement LearningHierarchical Relative Policy OptimizationPareto Optimal Policy OptimizationVector Policy OptimizationPreference Coordinated Multi-agent Policy OptimizationAPPO: Agentic Procedural Policy OptimizationEvolved Policy GradientsKolmogorov Regression for Robust Diffusion PoliciesGRPO (Group Relative Policy Optimization)KL-regularized RL
Recent events (1)
DRPO: Smooth divergence regularization replaces hard masking in LLM RL training
A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.