Almanac
technique

Cross Activation Shift Distance

techniqueactiveprovisionalcross-activation-shift-distance-50aad589·1 events·first seen 13d ago

Aliases: Cross Activation Shift Distance

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·13d ago·source ↗

Backdoor unlearning in LLMs generalizes across unknown triggers via cross-backdoor transfer

Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.