Almanac
paper

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

paperactiveprovisionalactionable-activation-directions-for-detecting-and-mitigating-emergent-misalignment-across-language-model-families-85acd889·1 events·first seen 2d ago

Aliases: Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·2d ago·source ↗

Activation-space directions for detecting and mitigating emergent misalignment across LLM families

Researchers fine-tuned four small instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3B) on insecure code to induce emergent misalignment, then investigated whether a shared activation-space direction could detect and correct it. A difference-in-means direction achieves 99.6% separation of aligned vs. misaligned activations within each model, and causal steering by subtracting this direction reduces misaligned behavior by 21–51 points. Cross-architecture transfer via ridge regression yields large behavioral suppression but fails specificity controls, revealing a two-tier structure: within-model directions are causally specific and actionable, while cross-model directions are real but non-specific. The findings bound the utility of linear cross-architecture correction and recommend within-model probing for safety auditing.