8B autoregressive language model
8b-autoregressive-language-model-a26da78b·1 events·first seen 28d agoAliases: 8B autoregressive language model
Co-occurring entities
More like this (12)
Recent events (1)
Language-Switching Backdoor Triggers Use Orthogonal Latent Subspace in LLMs
Researchers identify and decompose the internal circuit underlying a language-switching backdoor attack in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit operates in three phases: early attention heads compose trigger tokens, a mid-layer signal propagates through a subspace orthogonal to the model's natural language-identity direction, and a final MLP layer converts the latent signal into French logits. The entire circuit flows through a serial bottleneck at a single sequence position, meaning corrupting that position mitigates the trigger but also degrades general capabilities. Critically, the orthogonal encoding means defenses that search for language-like signals in intermediate representations would fail to detect this trigger.