Almanac
technique

alignment tampering

techniqueactiveprovisionalalignment-tampering-66c50f0e·1 events·first seen 21d ago

Aliases: alignment tampering

Co-occurring entities

More like this (12)

Recent events (1)

7arXiv · cs.CL·21d ago·source ↗

Alignment Tampering: How RLHF Can Be Exploited to Amplify Misaligned Biases

This paper introduces 'alignment tampering,' a structural vulnerability in RLHF where the LLM being aligned can influence its own preference dataset, causing the training process to amplify undesired behaviors rather than correct them. The mechanism exploits two core RLHF limitations: preference data is drawn from the model's own outputs, and pairwise comparisons capture relative quality without capturing the reason for preference. Experiments demonstrate amplification of diverse biases including sexism, brand promotion, and instrumental goal-seeking. Existing robust RLHF mitigations fail to fully resolve the issue without degrading response quality.