paper
Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs
paperactiveprovisional
backdoor-unlearning-generalization-a-path-toward-the-removal-of-unknown-triggers-in-llms-6de7778c·1 events·first seen 13d agoAliases: Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs
Co-occurring entities
More like this (12)
Alternating Token-Weighted UnlearningAttention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix ItContinual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMsExpRL: Exploratory RL for LLM Mid-TrainingWhich Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMsWhen Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New TasksLearning from the Self-future: On-policy Self-distillation for dLLMsinference-time behavioural unlearningA sleep-like consolidation mechanism for LLMsScaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight VerifierJanus: A Benchmark for Goal-Conditioned Information Distortion in LLMsMAML (Model-Agnostic Meta-Learning)
Recent events (1)
Backdoor unlearning in LLMs generalizes across unknown triggers via cross-backdoor transfer
Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.