5arXiv cs.CL (Computation and Language)·4h ago

LIHA reveals first-token broadcaster heads as mechanistic source of language identity in transformers

Researchers introduce Language Identity Head Ablation (LIHA), a causal intervention that zeros individual attention heads to measure language-switching behavior across 2,700 prompt-language pairs in seven languages. Applied to GPT-2, LIHA identifies a small set of 'first-token broadcaster' heads that propagate language identity signals throughout generation, with compensatory redistribution following a hierarchical, feedforward pattern. A controlled comparison between Qwen2.5-1.5B-Base and Qwen2.5-1.5B-Instruct provides direct causal evidence that instruction tuning reorganizes language identity circuits toward early-layer localization. The findings offer mechanistic grounding for why multilingual models generate in the wrong language and why this is difficult to correct.

Evaluation and Benchmarking Alignment and RLHF First-Token Broadcasters: Mechanistic Origins of Language Identity and Distributed Robustness in Transformers Language Identity Head Ablation Qwen2.5-7B-Instruct-1M Qwen2.5-1.5B-Base GPT-2

Related guides (2)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

8Openai Blog·1mo ago·source ↗

Aligning language models to follow instructions

OpenAI published a blog post describing their work on aligning language models to follow human instructions, corresponding to the InstructGPT research. This work introduced reinforcement learning from human feedback (RLHF) as a core technique for training models to be more helpful, honest, and aligned with user intent. The approach demonstrated that smaller instruction-tuned models could outperform larger base models on human preference evaluations, marking a foundational shift in how language models are trained and deployed.

Frontier Model Releases Alignment and RLHF GPT-3 Reinforcement Learning from Human Feedback OpenAI +1 more

6arXiv · cs.CL·1mo ago·source ↗

Language-Switching Backdoor Triggers Use Orthogonal Latent Subspace in LLMs

Researchers identify and decompose the internal circuit underlying a language-switching backdoor attack in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit operates in three phases: early attention heads compose trigger tokens, a mid-layer signal propagates through a subspace orthogonal to the model's natural language-identity direction, and a final MLP layer converts the latent signal into French logits. The entire circuit flows through a serial bottleneck at a single sequence position, meaning corrupting that position mitigates the trigger but also degrades general capabilities. Critically, the orthogonal encoding means defenses that search for language-like signals in intermediate representations would fail to detect this trigger.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Backdoor Circuit Analysis (Language-Switching)8B autoregressive language model +2 more

5Openai Blog·1mo ago·source ↗

Why Language Models Hallucinate

OpenAI published research explaining the mechanisms behind language model hallucination. The work connects improved evaluation methods to enhanced AI reliability, honesty, and safety. The body is sparse on technical detail, but the framing positions this as foundational research relevant to alignment and deployment trust.

Evaluation and Benchmarking AI Safety Research hallucination (LLM)OpenAI +1 more

6arXiv · cs.CL·12d ago·source ↗

The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation

Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.

Evaluation and Benchmarking AI Safety Research DeepSeek V4 Mistral Large 2 GPT-4o +8 more

6arXiv · cs.LG·4d ago·source ↗

Program synthesis used to reverse-engineer transformer attention heads with executable Python surrogates

Researchers propose a pipeline that approximates transformer attention heads with executable Python programs generated by a language model, then re-ranked by held-out predictive accuracy. Applied to GPT-2, TinyLlama-1.1B, and Llama-3B, fewer than 1,000 programs reproduce attention patterns with >75% average IoU similarity on TinyStories. Replacing 25% of attention heads with programmatic surrogates incurs only a 16% average perplexity increase while preserving downstream QA performance, demonstrating a path toward symbolic transparency in neural models.

Evaluation and Benchmarking AI Safety Research Llama 3.2 GPT-2 Explaining Attention with Program Synthesis +2 more

7arXiv · cs.CL·6d ago·source ↗

Language models linearly encode a 'value axis' tracking expected goal success, study finds

Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.

AI Safety Research Alignment and RLHF The Value Axis: Language Models Encode Whether They're on the Right Track Direct Preference Optimization (DPO)Qwen3-4B

5arXiv · cs.LG·1mo ago·source ↗

Artificial Aphasias in Lesioned Language Models

Researchers apply an aphasia-inspired 'lesioning' technique to five 1B-scale language models by zeroing out model parameters and measuring resulting language impairments against a Text Aphasia Battery (TAB). Across 112,426 outputs, the full range of aphasia symptoms emerges but in distributions distinct from human aphasia profiles. The study finds systematic differences between attention components (query, key, value, output) and feed-forward components, as well as depth-dependent effects where early-layer lesions cause syntactic/semantic symptoms and late-middle layers yield phonological and fluency deficits. The qualitative divergence between LM and human aphasia patterns suggests aphasia syndromes are shaped by learning and processing details rather than being universal consequences of disrupted language processing.

Evaluation and Benchmarking aphasia 1B-scale language models lesioning technique +1 more

4arXiv · cs.CL·5d ago·source ↗

Cross-lingual in-context learning source language selection challenges fine-tuning assumptions

A new arXiv paper conducts a broad empirical study of cross-lingual transfer in few-shot in-context learning (ICL), spanning seven tasks, six models, and a typologically diverse set of languages. The study finds that conventional heuristics from supervised fine-tuning — such as relying on linguistic similarity or data availability — do not consistently transfer to the ICL regime. The authors also analyze language confusion as a key obstacle in generative cross-lingual ICL and propose alternative heuristics for source language selection.

Evaluation and Benchmarking When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning