Entity · model

8B autoregressive language model

modelactive8b-autoregressive-language-model-a26da78b·1 events·first seen May 19, 2026

Aliases: 8B autoregressive language model

Co-occurring entities

mechanistic interpretability Backdoor Circuit Analysis (Language-Switching)backdoor attack attention head circuit

More like this (12)

1B-scale language models 7B language model Random Language Model large language model agents Reinforcement Learning for Language Models mRNA Language Model 32B language model unsupervised language modeling generative language modeling large language models multi-turn language models AnyLanguageModel

Recent events (1)

6arXiv · cs.CL·May 19, 2026·source ↗

Language-Switching Backdoor Triggers Use Orthogonal Latent Subspace in LLMs

Researchers identify and decompose the internal circuit underlying a language-switching backdoor attack in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit operates in three phases: early attention heads compose trigger tokens, a mid-layer signal propagates through a subspace orthogonal to the model's natural language-identity direction, and a final MLP layer converts the latent signal into French logits. The entire circuit flows through a serial bottleneck at a single sequence position, meaning corrupting that position mitigates the trigger but also degrades general capabilities. Critically, the orthogonal encoding means defenses that search for language-like signals in intermediate representations would fail to detect this trigger.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Backdoor Circuit Analysis (Language-Switching)8B autoregressive language model +2 more