Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning
using-reward-uncertainty-to-induce-diverse-behaviour-in-reinforcement-learning-a10412d1·1 events·first seen 14d agoAliases: Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning
More like this (12)
Recent events (1)
Reward uncertainty as a principled mechanism for diverse RL behaviour
A new arXiv preprint proposes replacing the scalar reward in RL with a distribution over reward functions, applying a non-linear objective over sets of actions to induce calibrated behavioural diversity without sacrificing expected reward. The authors derive a principled gradient estimator in the contextual bandit setting and prove the formulation generalizes vanilla policy gradient and action-set approaches. The work is motivated by applications like language model fine-tuning where diversity is desirable but entropy regularization and diversity bonuses introduce fragile trade-offs. Empirical results support the framework as a theoretically grounded alternative to heuristic diversity methods.