other
alignment tax
otheractiveprovisional
alignment-tax-98209ad9·1 events·first seen 15d agoAliases: alignment tax
Co-occurring entities
More like this (12)
Recent events (1)
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.