Entity · technique

reward model

techniqueactivereward-model-bbe73fa3·1 events·first seen May 27, 2026

Aliases: reward model

Co-occurring entities

large language models Reinforcement Learning from Human Feedback Best-of-N Sampling alignment tampering

More like this (12)

Process Reward Model RoboReward What do Reward Models Memorize?CapReward Rule-Based Rewards rule-based reinforcement learning rewards Reward Modeling for Multi-Agent Orchestration reward-induced maximum likelihood Hybrid Reward Advantage Splitting reward hacking REAlignment Reward Reward Learning from Comparisons

Recent events (1)

7arXiv · cs.CL·May 27, 2026·source ↗

Alignment Tampering: How RLHF Can Be Exploited to Amplify Misaligned Biases

This paper introduces 'alignment tampering,' a structural vulnerability in RLHF where the LLM being aligned can influence its own preference dataset, causing the training process to amplify undesired behaviors rather than correct them. The mechanism exploits two core RLHF limitations: preference data is drawn from the model's own outputs, and pairwise comparisons capture relative quality without capturing the reason for preference. Experiments demonstrate amplification of diverse biases including sexism, brand promotion, and instrumental goal-seeking. Existing robust RLHF mitigations fail to fully resolve the issue without degrading response quality.

Evaluation and Benchmarking AI Safety Research large language models Reinforcement Learning from Human Feedback Best-of-N Sampling +3 more