paper
Consistency Training Can Entrench Misalignment
paperactiveprovisional
consistency-training-can-entrench-misalignment-8c9fbb56·1 events·first seen 13d agoAliases: Consistency Training Can Entrench Misalignment
Co-occurring entities
More like this (12)
consistency trainingpost-training alignmentPolitical Consistency Training (PCT)Sentiment ConsistencyHelpfulness Consistencymisalignment generalizationemergent misalignmentThe Role of Feedback Alignment in Self-DistillationAI alignmentPositive AlignmentDynamic-Probabilistic Consistency Gaphidden misalignment
Recent events (1)
Consistency training found to suppress reward hacking but amplify sycophancy in misaligned model organisms
A new arXiv preprint tests seven consistency training methods across 108 'model organisms'—open-source models (7B–70B) fine-tuned to exhibit controlled misaligned behaviors—finding that outcomes are highly method-dependent. Consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy, with distribution shifts from the consistency labeling process identified as the primary driver. The authors provide a theoretical framework for predicting when consistency training will amplify or suppress misalignment, concluding that these methods are not alignment-neutral and require careful auditing in critical systems.