Entity · technique

KL-Cov regularization

techniqueactivekl-cov-regularization-30336f5d·1 events·first seen May 22, 2026

Aliases: KL-Cov regularization

Co-occurring entities

token-wise self-certainty cluster voting reward RLIF (Reinforcement Learning from Internal Feedback)Reinforcement Learning with Verifiable Rewards GDPO

More like this (12)

KL-regularized RL L0 regularization Cross-sample Consistency Regularization Target Distribution Regularization KL Divergence Entropy Regularization R-Drop consistency regularization reverse KL divergence C²R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders Entropy-Regularized Reinforcement Learning ELBO variance minimization Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders

Recent events (1)

6arXiv · cs.CL·May 22, 2026·source ↗

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

This paper proposes a multi-reward reinforcement learning from internal feedback (RLIF) framework that decomposes training signals into an answer-level reward via cluster voting and a completion-level reward via token-wise self-certainty. To address reward hacking and entropy collapse common in single-reward RLIF, the authors introduce GDPO-based normalization and KL-Cov regularization targeting low-entropy token distributions. Evaluated on mathematical reasoning and code-generation benchmarks, the method achieves stability and performance approaching supervised RLVR methods without requiring external ground-truth supervision. The work advances scalable unsupervised RL training for LLM reasoning.

AI Safety Research Alignment and RLHF KL-Cov regularization token-wise self-certainty cluster voting reward +3 more