Entity · organization

ML Alignment & Theory Scholars Program

organizationactiveml-alignment-theory-scholars-program-8ce83804·1 events·first seen Jun 1, 2026

Aliases: ML Alignment & Theory Scholars Program

Co-occurring entities

Gemma 2 9B assistant axis Llama 3.1 70B EQ-Bench DeepSeek V4 MMLU-Pro activation capping Qwen3 32B University of Oxford IFEval Christina Lu GSM8K Anthropic

More like this (12)

The Alignment Project MedAlign How Does Alignment Tuning Shape Representations of Sycophancy and Related Cue-Induced Biases in LLMs?Alignment Research Center AI alignment Latent Embedding Alignment Cross-Theory Harmonization JAM (Judge for Adaptive Metric-Alignment)ATLAS: Active Theory Learning for Automated Science Measuring the Gap Between Human and LLM Research Ideas Measuring the Gap Between Human and LLM Research Ideas Artificial Analysis LLM Performance Leaderboard

Recent events (1)

6The Batch·Jun 1, 2026·source ↗

Activation Capping Technique Stabilizes LLM Assistant Personas Against Drift and Jailbreaks

Researchers from MATS, Oxford, and Anthropic introduced the 'assistant axis,' a vector derived from LLM layer outputs that quantifies how closely a model adheres to its trained assistant persona. They developed 'activation capping,' an inference-time method that corrects deviations from this axis when similarity falls below a threshold. Testing on Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B showed harmful response rates to jailbreak prompts dropped by roughly half (e.g., 83% to 41% for Qwen3 32B) without degrading benchmark performance. The technique targets character-based jailbreaks that bypass system prompts by manipulating a model's internal representational state.

Evaluation and Benchmarking AI Safety Research Gemma 2 9B assistant axis Llama 3.1 70B +12 more