Entity · technique

Activation Steering

techniqueactiveactivation-steering-b887fd1d·2 events·first seen May 28, 2026

Aliases: Activation Steering

Co-occurring entities

on-policy distillation SafeSteer alignment tax reverse KL divergence Safety Detection Classifier HHH (Helpful, Harmless, Honest)AUROC Synthetic Data Generator

More like this (12)

State-Conditioned Dynamic Steering Agentic Chain-of-Thought Steering SafeSteer representation-level steering Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement AI control Steerable Model Merging steering vectors AI alignment Stability AI Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning AI Control Roadmap

Recent events (2)

6arXiv · cs.AI·Jun 2, 2026·source ↗

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.

Evaluation and Benchmarking AI Safety Research on-policy distillation SafeSteer alignment tax +3 more

6arXiv · cs.CL·May 28, 2026·source ↗

Activation Steering for Synthetic Safety Data Generation: Diversity as a Critical Quality Axis

This paper investigates whether activation steering (AS) can generate high-quality synthetic training data for downstream safety detection classifiers, filling a gap in the literature. Across 4 safety concepts × 2 models × 4 steering methods, the authors find that AS-generated data outperforms prompt-generated data on 3 of 4 concepts, but only 41 of 136 configurations succeed, indicating a narrow effective regime. The study introduces sample- and set-level diversity as a previously absent quality axis, finding that higher steering strength reduces diversity and that the harmonic mean of success, coherence, and diversity correlates more reliably with downstream AUROC than prior metrics alone. The results provide a practical heuristic for practitioners tuning AS hyperparameters for safety data generation.

Evaluation and Benchmarking AI Safety Research Safety Detection Classifier HHH (Helpful, Harmless, Honest)Activation Steering +3 more