Entity · technique

backdoor attack

techniqueactivebackdoor-attack-c42b002f·1 events·first seen May 19, 2026

Aliases: backdoor attack

Co-occurring entities

mechanistic interpretability Backdoor Circuit Analysis (Language-Switching)8B autoregressive language model attention head circuit

More like this (12)

Embedded Attack Backdoor Circuit Analysis (Language-Switching)data exfiltration reward hacking URL-based data exfiltration skill-based attacks distillation attacks pickle exploit black-box jailbreaking Input-Aware Dynamic Backdoor Attack Against Quantum Neural Networks black-box adversarial attacks social engineering

Recent events (1)

6arXiv · cs.CL·May 19, 2026·source ↗

Language-Switching Backdoor Triggers Use Orthogonal Latent Subspace in LLMs

Researchers identify and decompose the internal circuit underlying a language-switching backdoor attack in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit operates in three phases: early attention heads compose trigger tokens, a mid-layer signal propagates through a subspace orthogonal to the model's natural language-identity direction, and a final MLP layer converts the latent signal into French logits. The entire circuit flows through a serial bottleneck at a single sequence position, meaning corrupting that position mitigates the trigger but also degrades general capabilities. Critically, the orthogonal encoding means defenses that search for language-like signals in intermediate representations would fail to detect this trigger.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Backdoor Circuit Analysis (Language-Switching)8B autoregressive language model +2 more