technique
reward model
techniqueactiveprovisional
reward-model-bbe73fa3·1 events·first seen 21d agoAliases: reward model
Co-occurring entities
More like this (12)
Process Reward ModelCapRewardRule-Based Rewardsrule-based reinforcement learning rewardsReward Modeling for Multi-Agent Orchestrationreward-induced maximum likelihoodHybrid Reward Advantage Splittingreward hackingReward Learning from ComparisonsGradient-Guided Reward OptimizationScaling Laws for Reward Model Overoptimizationrubric-based rewards
Recent events (1)
Alignment Tampering: How RLHF Can Be Exploited to Amplify Misaligned Biases
This paper introduces 'alignment tampering,' a structural vulnerability in RLHF where the LLM being aligned can influence its own preference dataset, causing the training process to amplify undesired behaviors rather than correct them. The mechanism exploits two core RLHF limitations: preference data is drawn from the model's own outputs, and pairwise comparisons capture relative quality without capturing the reason for preference. Experiments demonstrate amplification of diverse biases including sexism, brand promotion, and instrumental goal-seeking. Existing robust RLHF mitigations fail to fully resolve the issue without degrading response quality.