Almanac
model

8B autoregressive language model

modelactive8b-autoregressive-language-model-a26da78b·1 events·first seen 28d ago

Aliases: 8B autoregressive language model

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·28d ago·source ↗

Language-Switching Backdoor Triggers Use Orthogonal Latent Subspace in LLMs

Researchers identify and decompose the internal circuit underlying a language-switching backdoor attack in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit operates in three phases: early attention heads compose trigger tokens, a mid-layer signal propagates through a subspace orthogonal to the model's natural language-identity direction, and a final MLP layer converts the latent signal into French logits. The entire circuit flows through a serial bottleneck at a single sequence position, meaning corrupting that position mitigates the trigger but also degrades general capabilities. Critically, the orthogonal encoding means defenses that search for language-like signals in intermediate representations would fail to detect this trigger.