Almanac
other

alignment tax

otheractiveprovisionalalignment-tax-98209ad9·1 events·first seen 15d ago

Aliases: alignment tax

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·15d ago·source ↗

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.