6arXiv cs.CL (Computation and Language)·10d ago

The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation

Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF DeepSeek V4 Mistral Large 2 GPT-4o The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models Google DeepMind Llama OpenAI Gemini-3.1-Pro Meta Qwen3.6-Plus

Related guides (4)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Google DeepMind

Google DeepMind: Frontier AI Across Models, Robotics, and Scientific Discovery

Read asIn-depth

DeepSeek V4

DeepSeek V4: The Open-Weights Giant Reshaping AI Economics

Read asBeginner In-depth

AI Safety ResearchTopic guide

AI Safety Research: From Lab Evals to Geopolitical Flashpoint

Read asIn-depth

Related events (8)

7arXiv · cs.AI·26d ago·source ↗

Geopolitical Bias in LLMs Originates in Post-Training, Not Pre-Training Data

A study testing seven open-weight LLM pairs (base vs. chat models) across seven labs finds that geopolitical bias is introduced during post-training rather than inherited from pre-training data. Six of seven labs showed post-training shifts favoring the developer's home country or region, with Alibaba's Qwen 2.5 showing the most extreme shift (18x increase in China-favourability log-odds). The effect is also language-dependent: Mistral becomes pro-France only under French prompting. The authors argue this implicates alignment and RLHF processes as active shapers of geopolitical perspective, calling for greater transparency and auditing of post-training pipelines.

Evaluation and Benchmarking Open Weights Progress Mistral AI Alibaba Mistral +6 more

6arXiv · cs.CL·17d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

5arXiv · cs.CL·15d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

5arXiv · cs.CL·10d ago·source ↗

Audit finds cultural translation failures and diversity collapse in LLM-adapted math word problems across 7 languages

Researchers audited how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into seven languages spanning South Asia and Italy, annotating 6,489 entity transformations. Models agreed on transformation type only 62.5% of the time and on specific substitutions in just 33.5% of cases, meaning model choice substantially shapes the cultural world students encounter. All 21 language-model combinations exhibited 'entropy collapse'—adaptations compressed rather than expanded cultural diversity—and models produced systematic regional misattributions (e.g., Bangladeshi currency for Indian Bengali students) and cross-cultural contamination (e.g., egg hunts framed as Eid activities). The study highlights that surface plausibility masks deeper corpus-level failures invisible in individual translations.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions Google +4 more

4Hugging Face Blog·1mo ago·source ↗

Letting Large Models Debate: The First Multilingual LLM Debate Competition

Hugging Face introduces a multilingual LLM debate competition where large language models compete against each other in structured debates. The initiative explores multi-agent interaction, argumentation quality, and cross-lingual reasoning capabilities. This represents an evaluation framework for assessing LLM persuasion, coherence, and multilingual performance in adversarial settings.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face LLM Debate Competition

4arXiv · cs.CL·1mo ago·source ↗

Quantifying Cross-Linguistic Effects of Syncretism on Agreement Attraction Using LLM Processing Proxies

This paper investigates why morphological syncretism amplifies agreement attraction errors in some languages (English, German, Russian) but not others (Turkish, Armenian), a pattern lacking a principled account. The authors use surprisal and attention entropy derived from large language models as proxies for human sentence processing across four languages. LLM-derived measures successfully replicate behavioral findings in English and German, align with Turkish null results, and partially capture Russian patterns. The work demonstrates LLMs as tools for cross-linguistic psycholinguistic investigation.

Evaluation and Benchmarking agreement attraction large language models surprisal +2 more

6arXiv · cs.CL·12d ago·source ↗

Study finds local languages provide better cultural knowledge access in LLMs once proficiency is controlled

A new arXiv paper introduces a controlled evaluation framework to disentangle language proficiency from culture-specific knowledge access in LLMs. Using real-world cultural questions across 13 locales and ~80 models, the authors apply item response theory to show that while English dominates on culture-agnostic questions, local languages yield a consistent knowledge-access advantage on culture-specific questions once proficiency differences are factored out. The finding challenges the common interpretation that weaker local-language accuracy implies weaker cultural knowledge, and has implications for how multilingual and regionally-aligned models are evaluated.

Evaluation and Benchmarking The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs item response theory

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more