6arXiv cs.CL (Computation and Language)·1mo ago

Language-Switching Backdoor Triggers Use Orthogonal Latent Subspace in LLMs

Researchers identify and decompose the internal circuit underlying a language-switching backdoor attack in an 8B-parameter autoregressive language model, where a three-word Latin trigger redirects English output to French. The circuit operates in three phases: early attention heads compose trigger tokens, a mid-layer signal propagates through a subspace orthogonal to the model's natural language-identity direction, and a final MLP layer converts the latent signal into French logits. The entire circuit flows through a serial bottleneck at a single sequence position, meaning corrupting that position mitigates the trigger but also degrades general capabilities. Critically, the orthogonal encoding means defenses that search for language-like signals in intermediate representations would fail to detect this trigger.

Evaluation and Benchmarking AI Safety Research mechanistic interpretability Backdoor Circuit Analysis (Language-Switching)8B autoregressive language model backdoor attack attention head circuit

Related guides (3)

mechanistic interpretabilityConcept

Mechanistic Interpretability: Looking Inside the AI Black Box

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·17d ago·source ↗

Backdoor unlearning in LLMs generalizes across unknown triggers via cross-backdoor transfer

Researchers demonstrate that training an LLM to unlearn a single backdoor trigger can suppress other backdoors that were never explicitly targeted, a phenomenon they call cross-backdoor transfer. The study spans three model families with backdoors injected via pretraining or continual pretraining, and introduces a new metric called Cross Activation Shift Distance to quantify the relationship between different unlearning interventions. The finding opens a potential defensive strategy where defenders deliberately inject and then remove controlled backdoors to suppress unknown attacker-planted backdoors.

AI Safety Research Alignment and RLHF Cross Activation Shift Distance Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

6arXiv · cs.CL·17d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

6arXiv · cs.CL·10d ago·source ↗

The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation

Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.

Evaluation and Benchmarking AI Safety Research DeepSeek V4 Mistral Large 2 GPT-4o +8 more

6arXiv · cs.CL·19d ago·source ↗

Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding

This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.

Frontier Model Releases Evaluation and Benchmarking BLEU-4 Graph Transformer Diffusion Language Models +5 more

4arXiv · cs.CL·1mo ago·source ↗

Quantifying Cross-Linguistic Effects of Syncretism on Agreement Attraction Using LLM Processing Proxies

This paper investigates why morphological syncretism amplifies agreement attraction errors in some languages (English, German, Russian) but not others (Turkish, Armenian), a pattern lacking a principled account. The authors use surprisal and attention entropy derived from large language models as proxies for human sentence processing across four languages. LLM-derived measures successfully replicate behavioral findings in English and German, align with Turkish null results, and partially capture Russian patterns. The work demonstrates LLMs as tools for cross-linguistic psycholinguistic investigation.

Evaluation and Benchmarking agreement attraction large language models surprisal +2 more

5arXiv · cs.CL·11d ago·source ↗

Predictor-gated bank-wise sparsity recipe for dense-to-sparse LLM upcycling from Qwen2.5-8B

A new arXiv preprint introduces a continual training recipe to convert dense LLMs into channel-sparse models without post-hoc pruning. Starting from a Qwen2.5-8B checkpoint, the method uses a low-rank predictor to gate FFN channel routing, achieving 4x sparsity in FFN intermediate activations via a bank-wise top-k rule at 32K context. The routing module is trained on the main language modeling path, making the resulting sparsity hardware-oriented rather than approximate. The authors also identify and patch a layer-local long-context failure mode on the RULER-CWE benchmark.

Training Infrastructure Inference Economics Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs SwiGLU RULER-CWE +1 more

6arXiv · cs.CL·1mo ago·source ↗

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

LASH is a black-box jailbreak framework that adaptively composes outputs from multiple existing attack families into hybrid prompts using a genetic optimizer with a two-stage fitness function. Evaluated on JailbreakBench across six target models, LASH achieves 84.5% attack success rate (keyword-based) and 74.5% (LLM-judge) with only 30 mean target queries, outperforming five state-of-the-art baselines. The work demonstrates that no single jailbreak family dominates across models and harm categories, and that adaptive cross-strategy composition is a promising red-teaming direction. Results hold under three defense mechanisms.

Evaluation and Benchmarking AI Safety Research LASH black-box jailbreaking JailbreakBench +3 more

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study