paper

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

paperactiveprovisional

actionable-activation-directions-for-detecting-and-mitigating-emergent-misalignment-across-language-model-families-85acd889

·1 events·first seen 2d ago

Aliases: Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

Co-occurring entities

Llama 3.2 Gemma 2 Qwen2.5-1.5B Ministral 3B difference-in-means

More like this (12)

Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models The Value Axis: Language Models Encode Whether They're on the Right Track The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model Transformer Language Models Language Model Safety Monitor Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models emergent misalignment Consistency Training Can Entrench Misalignment Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application Reasoning Language Models Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

Recent events (1)

6arXiv · cs.CL·2d ago·source ↗

Activation-space directions for detecting and mitigating emergent misalignment across LLM families

Researchers fine-tuned four small instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3B) on insecure code to induce emergent misalignment, then investigated whether a shared activation-space direction could detect and correct it. A difference-in-means direction achieves 99.6% separation of aligned vs. misaligned activations within each model, and causal steering by subtracting this direction reduces misaligned behavior by 21–51 points. Cross-architecture transfer via ridge regression yields large behavioral suppression but fails specificity controls, revealing a two-tier structure: within-model directions are causally specific and actionable, while cross-model directions are real but non-specific. The findings bound the utility of linear cross-architecture correction and recommend within-model probing for safety auditing.

Evaluation and Benchmarking AI Safety Research Llama 3.2 Gemma 2 Qwen2.5-1.5B +4 more