Entity · technique

activation capping

techniqueactiveactivation-capping-42e9e777·1 events·first seen Jun 1, 2026

Aliases: activation capping

Co-occurring entities

Gemma 2 9B assistant axis Llama 3.1 70B EQ-Bench DeepSeek V4 ML Alignment & Theory Scholars Program MMLU-Pro Qwen3 32B University of Oxford IFEval Christina Lu GSM8K Anthropic

More like this (12)

Cap confidence gating CapCode activation patching CAPTCHA agentic coding key-value (KV) activation projection ACE page-agent GPU power capping description-aware gating CADE

Recent events (1)

6The Batch·Jun 1, 2026·source ↗

Activation Capping Technique Stabilizes LLM Assistant Personas Against Drift and Jailbreaks

Researchers from MATS, Oxford, and Anthropic introduced the 'assistant axis,' a vector derived from LLM layer outputs that quantifies how closely a model adheres to its trained assistant persona. They developed 'activation capping,' an inference-time method that corrects deviations from this axis when similarity falls below a threshold. Testing on Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B showed harmful response rates to jailbreak prompts dropped by roughly half (e.g., 83% to 41% for Qwen3 32B) without degrading benchmark performance. The technique targets character-based jailbreaks that bypass system prompts by manipulating a model's internal representational state.

Evaluation and Benchmarking AI Safety Research Gemma 2 9B assistant axis Llama 3.1 70B +12 more