The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
the-neutral-mask-how-rlhf-provides-shallow-alignment-while-leaving-partisan-structure-intact-in-a-large-language-model-382fa6b6·1 events·first seen 8d agoAliases: The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
Co-occurring entities
More like this (12)
Recent events (1)
RLHF produces shallow political neutrality by severing causal pathways, not erasing partisan structure
Researchers compare internal representations of Llama 3.1 8B before and after RLHF, finding that alignment training does not remove partisan political geometry from the model but instead compresses output variance to produce balanced responses. Sparse autoencoder decomposition shows that policy-encoding features active in the base model become completely inactive in the instruction-tuned version, while feature-level steering experiments confirm the causal disconnect is real. The underlying partisan structure remains intact and can be reactivated by inferring and amplifying a user's partisan identity, suggesting RLHF alignment is functionally fragile. The authors argue this 'disconnection rather than removal' pattern may generalize to other value domains beyond political orientation.