organization
ML Alignment & Theory Scholars Program
organizationactiveprovisional
ml-alignment-theory-scholars-program-8ce83804·1 events·first seen 15d agoAliases: ML Alignment & Theory Scholars Program
Co-occurring entities
More like this (12)
The Alignment ProjectMedAlignAlignment Research CenterAI alignmentATLAS: Active Theory Learning for Automated ScienceArtificial Analysis LLM Performance Leaderboarddeliberative alignmentUniversity of Texas SysML LabWang-ML-LabAI for Science programThe Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language ModelAllen Institute
Recent events (1)
Activation Capping Technique Stabilizes LLM Assistant Personas Against Drift and Jailbreaks
Researchers from MATS, Oxford, and Anthropic introduced the 'assistant axis,' a vector derived from LLM layer outputs that quantifies how closely a model adheres to its trained assistant persona. They developed 'activation capping,' an inference-time method that corrects deviations from this axis when similarity falls below a threshold. Testing on Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B showed harmful response rates to jailbreak prompts dropped by roughly half (e.g., 83% to 41% for Qwen3 32B) without degrading benchmark performance. The technique targets character-based jailbreaks that bypass system prompts by manipulating a model's internal representational state.