Audit finds cultural translation failures and diversity collapse in LLM-adapted math word problems across 7 languages
Researchers audited how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into seven languages spanning South Asia and Italy, annotating 6,489 entity transformations. Models agreed on transformation type only 62.5% of the time and on specific substitutions in just 33.5% of cases, meaning model choice substantially shapes the cultural world students encounter. All 21 language-model combinations exhibited 'entropy collapse'—adaptations compressed rather than expanded cultural diversity—and models produced systematic regional misattributions (e.g., Bangladeshi currency for Indian Bengali students) and cross-cultural contamination (e.g., egg hunts framed as Eid activities). The study highlights that surface plausibility masks deeper corpus-level failures invisible in individual translations.
Related guides (4)
Related events (8)
The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation
Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.
Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'
A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.
LLM-Based Grammar Adaptation for Metamodel-Grammar Co-Evolution in Model-Driven Engineering
This paper proposes using LLMs to automate grammar adaptation when metamodels evolve in model-driven engineering, replacing tedious manual work and outperforming rule-based methods. Evaluated on six real-world Xtext DSLs using Claude Sonnet 4.5, ChatGPT 5.1, and Gemini 3, all three LLMs achieved 100% adaptation consistency on test DSLs versus 62-84% for rule-based approaches. A longitudinal study on QVTo showed LLMs successfully reused learned adaptations across all evolution steps without manual editing. However, on large-scale grammars (EAST-ADL, 297 rules), LLM adaptation consistency dropped well below 90%, revealing a scalability limitation.
Study finds local languages provide better cultural knowledge access in LLMs once proficiency is controlled
A new arXiv paper introduces a controlled evaluation framework to disentangle language proficiency from culture-specific knowledge access in LLMs. Using real-world cultural questions across 13 locales and ~80 models, the authors apply item response theory to show that while English dominates on culture-agnostic questions, local languages yield a consistent knowledge-access advantage on culture-specific questions once proficiency differences are factored out. The finding challenges the common interpretation that weaker local-language accuracy implies weaker cultural knowledge, and has implications for how multilingual and regionally-aligned models are evaluated.
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.
Study finds AI-generated stories rely on superficial cultural markers rather than holistic localization
Researchers propose a method to measure the degree of 'templated' versus 'holistic' cultural localization in AI-generated stories, finding that only 9-17% of vocabulary accounts for cross-national variation and that a shared culturally-agnostic narrative template underlies most outputs. The study evaluates five models across 125 topics and 193 nationalities. A notable finding is that cultural markers associated with 19 countries—mostly in the Global South—are rated as offensive on average, raising concerns about bias and representation in multilingual/multicultural AI content generation.
Audit of Lombard language corpora reveals pervasive data quality and representational bias problems
Researchers conducted a manual audit of parallel and monolingual corpora available for Lombard, a low-resource language continuum from northern Italy. The study finds that web-scraped datasets suffer from severe language misidentification, boilerplate text, and non-linguistic noise, making apparent data abundance illusory. Additionally, high-quality data is heavily skewed toward Western Lombard varieties, leaving Eastern varieties underrepresented. The authors argue for variety-aware, community-driven curation over quantity-driven scraping.
Quantifying Cross-Linguistic Effects of Syncretism on Agreement Attraction Using LLM Processing Proxies
This paper investigates why morphological syncretism amplifies agreement attraction errors in some languages (English, German, Russian) but not others (Turkish, Armenian), a pattern lacking a principled account. The authors use surprisal and attention entropy derived from large language models as proxies for human sentence processing across four languages. LLM-derived measures successfully replicate behavioral findings in English and German, align with Turkish null results, and partially capture Russian patterns. The work demonstrates LLMs as tools for cross-linguistic psycholinguistic investigation.



