Entity · paper

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

paperactivebackdoor-unlearning-generalization-a-path-toward-the-removal-of-unknown-triggers-in-llms-6de7778c·1 events·first seen Jun 3, 2026

Aliases: Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

Co-occurring entities

Cross Activation Shift Distance

More like this (12)

Alternating Token-Weighted Unlearning Can We Break LLMs Out of Self-Loops? Fine-Grained Reasoning Control with Activation Steering Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It Towards Root Memories: Benchmarking and Enhancing Implicit Logical Memory Retrieval for Personalized LLMs Forecasting With LLMs: Improved Generalization Through Feature Steering Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs ExpRL: Exploratory RL for LLM Mid-Training Towards Mechanistically Understanding Why Memorized Knowledge Fails to Generalize in Large Language Model Finetuning Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks Learning from the Self-future: On-policy Self-distillation for dLLMs

Recent events (1)

6arXiv · cs.CL·Jun 3, 2026·source ↗

Backdoor unlearning in LLMs generalizes across unknown triggers via cross-backdoor transfer

Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.

AI Safety Research Alignment and RLHF Cross Activation Shift Distance Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs