5arXiv cs.CL (Computation and Language)·6d ago

Study finds AI-generated stories rely on superficial cultural markers rather than holistic localization

Researchers propose a method to measure the degree of 'templated' versus 'holistic' cultural localization in AI-generated stories, finding that only 9-17% of vocabulary accounts for cross-national variation and that a shared culturally-agnostic narrative template underlies most outputs. The study evaluates five models across 125 topics and 193 nationalities. A notable finding is that cultural markers associated with 19 countries—mostly in the Global South—are rated as offensive on average, raising concerns about bias and representation in multilingual/multicultural AI content generation.

Evaluation and Benchmarking AI Safety Research Characterizing Cultural Localization in AI-Generated Stories

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·11d ago·source ↗

Audit finds cultural translation failures and diversity collapse in LLM-adapted math word problems across 7 languages

Researchers audited how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into seven languages spanning South Asia and Italy, annotating 6,489 entity transformations. Models agreed on transformation type only 62.5% of the time and on specific substitutions in just 33.5% of cases, meaning model choice substantially shapes the cultural world students encounter. All 21 language-model combinations exhibited 'entropy collapse'—adaptations compressed rather than expanded cultural diversity—and models produced systematic regional misattributions (e.g., Bangladeshi currency for Indian Bengali students) and cross-cultural contamination (e.g., egg hunts framed as Eid activities). The study highlights that surface plausibility masks deeper corpus-level failures invisible in individual translations.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions Google +4 more

4arXiv · cs.CL·12d ago·source ↗

Large-scale social media analysis reveals stakeholder conflicts over machine translation priorities

Researchers analyze 79,286 social media posts from Reddit, Facebook, Bluesky, and Mastodon (2019–2025) to compare how four communities—AI developers, professional translators, language learners, and language service providers—discuss machine translation. The study finds significant disagreements and polarized sentiments across groups, with AI researchers framing MT as a technical benchmark problem while non-AI users prioritize quality nuances, trust, reliability, and social concerns. The work argues for redirecting MT research toward community-identified needs rather than benchmark performance alone.

Evaluation and Benchmarking Reddit Beyond Accuracy: Community Perspectives on Machine Translation

6arXiv · cs.CL·13d ago·source ↗

Study finds local languages provide better cultural knowledge access in LLMs once proficiency is controlled

A new arXiv paper introduces a controlled evaluation framework to disentangle language proficiency from culture-specific knowledge access in LLMs. Using real-world cultural questions across 13 locales and ~80 models, the authors apply item response theory to show that while English dominates on culture-agnostic questions, local languages yield a consistent knowledge-access advantage on culture-specific questions once proficiency differences are factored out. The finding challenges the common interpretation that weaker local-language accuracy implies weaker cultural knowledge, and has implications for how multilingual and regionally-aligned models are evaluated.

Evaluation and Benchmarking The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs item response theory

7The Batch·9d ago·source ↗

Study finds state media in training data causes LLMs to reflect government propaganda in native languages

Researchers from University of Oregon, Purdue, UCSD, NYU, and Princeton found that state-controlled media is heavily overrepresented in web-scraped training datasets, causing Claude 3 Sonnet and GPT-4o to express significantly more favorable attitudes toward authoritarian governments when prompted in those governments' native languages. Chinese state media accounts for over 40x more documents in CulturaX than Chinese Wikipedia, and both models reproduced state-media strings at 3-5% rates. When prompted in Chinese, both models favored China's government roughly 68-75% of the time versus English prompts on the same topics, with the effect scaling with a country's World Press Freedom Index ranking.

Frontier Model Releases Evaluation and Benchmarking New York University University of California San Diego CulturaX +14 more

6arXiv · cs.CL·4d ago·source ↗

Location metadata causes systematic geographic bias leakage in LLMs, even with 'Unknown' placeholders

Researchers evaluate 'location leakage' — the phenomenon where LLMs generate geographically biased outputs when exposed to location metadata in user profiles, even when prompts are geographically neutral. Across creative writing and Q&A tasks, leakage spikes up to 793x above baseline for models including Llama 3.1-8B, Qwen3-8B, and Claude Sonnet 4.6. A novel structural finding shows that replacing location with 'Unknown' still elevates leakage by up to 72x, indicating the user profile frame itself acts as a conditioning signal independent of geographic content. This has direct implications for AI systems that use user metadata for localization.

Evaluation and Benchmarking AI Safety Research Claude Sonnet 4 Alibaba Qwen3-4B +4 more

4arXiv · cs.AI·11d ago·source ↗

Study finds AI disclosure designs in newsrooms fail readers, proposes user-agency-centered alternatives

A paper from arXiv examines how newsrooms disclose AI involvement in news content, finding that neither brief labels nor detailed disclosures achieve the goal of building reader trust. A controlled experiment with 34 readers shows detailed disclosures trigger a 'transparency dilemma' that can reduce trust, while one-line labels create an information gap requiring cognitive effort to fill. Readers instead preferred disclosure designs centered on user agency, including detail-on-demand interactions, proportional AI-ratio visualizations, and explicit 'no AI' labels. The author frames this as a design problem for the HCI community rather than a journalism ethics problem alone.

AI Safety Research Enterprise Deployment Patterns Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News

4arXiv · cs.CL·25d ago·source ↗

C4STYLI Benchmark: Probing Cultural Aesthetic Stylistics Awareness in LLMs

Researchers introduce C4STYLI, a benchmark of stylized translated movie titles and advertising slogans from Hong Kong and mainland China, designed to evaluate LLMs on cross-cultural aesthetic stylistics. Evaluations reveal that LLMs diverge from human stylistic recognition, with recognition ability varying by text domain and not consistently predicting generation performance. Structural ablation using logistic regression probes shows that LLMs in the Hong Kong setting rely on surface-level linguistic cues rather than deeper stylistic structure, indicating limited cultural sensitivity.

Evaluation and Benchmarking C4STYLI large language models logistic regression probes +2 more

6arXiv · cs.CL·1mo ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more