4arXiv cs.CL (Computation and Language)·11d ago

Large-scale social media analysis reveals stakeholder conflicts over machine translation priorities

Researchers analyze 79,286 social media posts from Reddit, Facebook, Bluesky, and Mastodon (2019–2025) to compare how four communities—AI developers, professional translators, language learners, and language service providers—discuss machine translation. The study finds significant disagreements and polarized sentiments across groups, with AI researchers framing MT as a technical benchmark problem while non-AI users prioritize quality nuances, trust, reliability, and social concerns. The work argues for redirecting MT research toward community-identified needs rather than benchmark performance alone.

Evaluation and Benchmarking Reddit Beyond Accuracy: Community Perspectives on Machine Translation

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·29d ago·source ↗

Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

This paper investigates whether LLM-based machine translation can preserve moral semantic content well enough to enable cross-lingual moral values classification, using Polish as a test case with ~50k annotated social media posts. A four-method validation pipeline (LaBSE embedding similarity, CKA, LLM-as-judge, and classifier parity) shows mean cosine similarity of 0.86 and AUC gaps of only 0.01–0.02 across Moral Foundations categories. The results suggest machine translation is a practical path to extending moral values NLP research to under-resourced languages, with expected generalization to related Slavic languages.

Evaluation and Benchmarking Moral Foundations Theory Centered Kernel Alignment LLM-as-a-Judge +2 more

5arXiv · cs.CL·29d ago·source ↗

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

Researchers analyze public comment letters submitted to the Trump Administration's U.S. AI Action Plan consultation, applying topic modeling and frequency analysis to compare perspectives across stakeholder groups including academia, individuals, and the private sector. The study finds that individual submitters emphasize concerns about AI's societal impacts on daily life, while the final AI Action Plan predominantly reflects private sector priorities around security, policy, and development. A corpus cleaning pipeline is released alongside the findings. The work highlights a representational gap between public concerns and the resulting policy document.

AI Safety Research Regulatory Developments Trump Administration U.S. AI Action Plan topic modelling +1 more

5arXiv · cs.CL·5d ago·source ↗

Study finds AI-generated stories rely on superficial cultural markers rather than holistic localization

Researchers propose a method to measure the degree of 'templated' versus 'holistic' cultural localization in AI-generated stories, finding that only 9-17% of vocabulary accounts for cross-national variation and that a shared culturally-agnostic narrative template underlies most outputs. The study evaluates five models across 125 topics and 193 nationalities. A notable finding is that cultural markers associated with 19 countries—mostly in the Global South—are rated as offensive on average, raising concerns about bias and representation in multilingual/multicultural AI content generation.

Evaluation and Benchmarking AI Safety Research Characterizing Cultural Localization in AI-Generated Stories

3Hacker News·8d ago·source ↗

"Don't You Just Upload It to ChatGPT?" — community discussion on AI adoption expectations

A blog post with the title "Don't You Just Upload It to ChatGPT?" generated significant engagement on Hacker News (208 points, 186 comments), suggesting it touches on a relatable tension around non-technical users' expectations of AI tools versus practitioners' more nuanced workflows. The body content is not available, but the title implies commentary on the gap between casual AI use and professional or technical deployment. High engagement signals this resonates with the practitioner community.

Enterprise Deployment Patterns ChatGPT OpenAI

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

4arXiv · cs.CL·19d ago·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Ollama GPT-5.2 DeepL +8 more

5Openai Blog·1mo ago·source ↗

Lessons learned on language model safety and misuse

OpenAI published a post summarizing their evolving thinking on language model safety and misuse in deployed systems. The piece is intended to share lessons with other AI developers facing similar challenges. It covers OpenAI's internal approaches to mitigating harmful outputs and misuse patterns observed in production.

AI Safety Research Enterprise Deployment Patterns OpenAI

5Hacker News·23d ago·source ↗

Disagreement among frontier LLMs on real-world fact-checks

A study examines how frontier large language models diverge in their responses to real-world fact-checking queries, surfacing systematic disagreements across models on factual claims. The work appears to benchmark multiple leading models against a set of verifiable facts, revealing inconsistencies that have implications for reliability and deployment. With 475 HN points and 333 comments, the piece has generated substantial community discussion. The findings are relevant to evaluation methodology, model calibration, and trust in AI-generated factual content.

Frontier Model Releases Evaluation and Benchmarking frontier LLMs lenz.io Hacker News