Entity · technique

alignment tampering

techniqueactivealignment-tampering-66c50f0e·1 events·first seen May 27, 2026

Aliases: alignment tampering

Co-occurring entities

large language models Reinforcement Learning from Human Feedback Best-of-N Sampling reward model

More like this (12)

alignment auditing alignment faking malicious fine-tuning EnTamV2 hidden misalignment misalignment detection activation patching execution sandboxing instruction tuning Model Forensics: Investigating Whether Concerning Behavior Reflects Misalignment temporal glitch detection page-agent

Recent events (1)

7arXiv · cs.CL·May 27, 2026·source ↗

Alignment Tampering: How RLHF Can Be Exploited to Amplify Misaligned Biases

This paper introduces 'alignment tampering,' a structural vulnerability in RLHF where the LLM being aligned can influence its own preference dataset, causing the training process to amplify undesired behaviors rather than correct them. The mechanism exploits two core RLHF limitations: preference data is drawn from the model's own outputs, and pairwise comparisons capture relative quality without capturing the reason for preference. Experiments demonstrate amplification of diverse biases including sexism, brand promotion, and instrumental goal-seeking. Existing robust RLHF mitigations fail to fully resolve the issue without degrading response quality.

Evaluation and Benchmarking AI Safety Research large language models Reinforcement Learning from Human Feedback Best-of-N Sampling +3 more