7The Batch (DeepLearning.AI)·8d ago

Study finds state media in training data causes LLMs to reflect government propaganda in native languages

Researchers from University of Oregon, Purdue, UCSD, NYU, and Princeton found that state-controlled media is heavily overrepresented in web-scraped training datasets, causing Claude 3 Sonnet and GPT-4o to express significantly more favorable attitudes toward authoritarian governments when prompted in those governments' native languages. Chinese state media accounts for over 40x more documents in CulturaX than Chinese Wikipedia, and both models reproduced state-media strings at 3-5% rates. When prompted in Chinese, both models favored China's government roughly 68-75% of the time versus English prompts on the same topics, with the effect scaling with a country's World Press Freedom Index ranking.

Frontier Model Releases Evaluation and Benchmarking AI Safety Research Alignment and RLHF New York University University of California San Diego CulturaX Claude Opus 4.6 Eddie Yang GPT-4o Hannah Waight Princeton University OpenAI Common Crawl Purdue University Claude 3 Sonnet University of Oregon World Press Freedom Index Anthropic

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Claude Opus 4.6

Claude Opus 4.6: Anthropic's Milestone Model for Long-Context and Agentic Work

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Related events (8)

7arXiv · cs.AI·26d ago·source ↗

Geopolitical Bias in LLMs Originates in Post-Training, Not Pre-Training Data

A study testing seven open-weight LLM pairs (base vs. chat models) across seven labs finds that geopolitical bias is introduced during post-training rather than inherited from pre-training data. Six of seven labs showed post-training shifts favoring the developer's home country or region, with Alibaba's Qwen 2.5 showing the most extreme shift (18x increase in China-favourability log-odds). The effect is also language-dependent: Mistral becomes pro-France only under French prompting. The authors argue this implicates alignment and RLHF processes as active shapers of geopolitical perspective, calling for greater transparency and auditing of post-training pipelines.

Evaluation and Benchmarking Open Weights Progress Mistral AI Alibaba Mistral +6 more

6arXiv · cs.CL·10d ago·source ↗

The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation

Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.

Evaluation and Benchmarking AI Safety Research DeepSeek V4 Mistral Large 2 GPT-4o +8 more

5arXiv · cs.CL·5d ago·source ↗

Study finds AI-generated stories rely on superficial cultural markers rather than holistic localization

Researchers propose a method to measure the degree of 'templated' versus 'holistic' cultural localization in AI-generated stories, finding that only 9-17% of vocabulary accounts for cross-national variation and that a shared culturally-agnostic narrative template underlies most outputs. The study evaluates five models across 125 topics and 193 nationalities. A notable finding is that cultural markers associated with 19 countries—mostly in the Global South—are rated as offensive on average, raising concerns about bias and representation in multilingual/multicultural AI content generation.

Evaluation and Benchmarking AI Safety Research Characterizing Cultural Localization in AI-Generated Stories

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

4The Batch·1mo ago·source ↗

Abeba Birhane on Bias in Web-Scraped Training Datasets

Researcher Abeba Birhane examines how large-scale web-scraped datasets used to train trillion-parameter NLP and vision models propagate bias and antisocial content. The commentary highlights that performance gains in deep neural networks come alongside inherited societal biases from web training data. Two posts from The Batch summarize her work on cleaning up web datasets and the specific mechanisms by which NLP models absorb web-sourced biases.

Evaluation and Benchmarking AI Safety Research DeepLearning.AI Abeba Birhane The Batch

6arXiv · cs.CL·29d ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

5Openai Blog·1mo ago·source ↗

OpenAI, Georgetown CSET, and Stanford Internet Observatory Publish LLM Disinformation Misuse Report

OpenAI researchers collaborated with Georgetown University's Center for Security and Emerging Technology (CSET) and Stanford Internet Observatory to produce a report on how large language models could be misused to augment disinformation campaigns. The work draws on an October 2021 workshop with 30 experts across disinformation research, ML, and policy, plus over a year of additional research. The report outlines threat models for LLM-enabled disinformation and proposes a framework for analyzing potential mitigations.

AI Safety Research Regulatory Developments large language models Stanford Internet Observatory Georgetown University Center for Security and Emerging Technology +2 more

6Openai Blog·1mo ago·source ↗

Defining and Evaluating Political Bias in LLMs

OpenAI has published a post describing their methodology for evaluating political bias in ChatGPT, introducing new real-world testing approaches aimed at improving objectivity and reducing bias. The piece outlines how OpenAI defines political bias in the context of large language models and the evaluation frameworks they are developing to measure it. This represents OpenAI's public commitment to systematic bias measurement as a component of responsible deployment.

Evaluation and Benchmarking AI Safety Research political bias evaluation ChatGPT OpenAI +1 more