Entity · paper

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

paperactive

the-neutral-mask-how-rlhf-provides-shallow-alignment-while-leaving-partisan-structure-intact-in-a-large-language-model-382fa6b6

·1 events·first seen Jun 9, 2026

Aliases: The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

Co-occurring entities

Reinforcement Learning from Human Feedback Sparse Autoencoder Meta Llama-3.1-8B

More like this (12)

Random Language Model Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures Artificial Epanorthosis: Why large language models overuse a classical rhetorical figure, and how to mitigate it Recursive Language Models (RLMs)Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact Large-Language-Models-as-a-Judge in Theory-Agnostic Adaptive Metric-Alignment for Prototypical Networks in Personality Recognition Accelerating Masked Diffusion Large Language Models: A Survey of Efficient Inference Techniques The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models Understanding Large Language Models

Recent events (1)

7arXiv · cs.CL·Jun 9, 2026·source ↗

RLHF produces shallow political neutrality by severing causal pathways, not erasing partisan structure

Researchers compare internal representations of Llama 3.1 8B before and after RLHF, finding that alignment training does not remove partisan political geometry from the model but instead compresses output variance to produce balanced responses. Sparse autoencoder decomposition shows that policy-encoding features active in the base model become completely inactive in the instruction-tuned version, while feature-level steering experiments confirm the causal disconnect is real. The underlying partisan structure remains intact and can be reactivated by inferring and amplifying a user's partisan identity, suggesting RLHF alignment is functionally fragile. The authors argue this 'disconnection rather than removal' pattern may generalize to other value domains beyond political orientation.

AI Safety Research Alignment and RLHF Reinforcement Learning from Human Feedback The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model Sparse Autoencoder +2 more