5arXiv cs.CL (Computation and Language)·11h ago

Study finds readers prefer human literary translations over LLM-based MT, but cannot reliably distinguish them

A new arXiv paper presents a reader-centered evaluation of AI vs. human literary translation across 15 novels in French, Polish, and Japanese translated into English. Fifteen avid readers compared human translations (HT) to machine translations (MT) from an agentic LLM pipeline, finding MT 'fine' but preferring HT for ease, clarity, and immersiveness—especially at the chunk level (522/772 preferences). Critically, readers could not reliably identify which version was human-produced (17/30 correct), and automatic metrics including LLM-as-a-judge consistently favored MT over HT, diverging from human preference. The authors release LAIT, a dataset with 1K reader comments, 2K judgments, and 7.2K span-level annotations.

Evaluation and Benchmarking LAIT (Literary AI Translation)AI translation of literary texts is "fine", but readers still prefer human translations

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·16d ago·source ↗

Large-scale social media analysis reveals stakeholder conflicts over machine translation priorities

Researchers analyze 79,286 social media posts from Reddit, Facebook, Bluesky, and Mastodon (2019–2025) to compare how four communities—AI developers, professional translators, language learners, and language service providers—discuss machine translation. The study finds significant disagreements and polarized sentiments across groups, with AI researchers framing MT as a technical benchmark problem while non-AI users prioritize quality nuances, trust, reliability, and social concerns. The work argues for redirecting MT research toward community-identified needs rather than benchmark performance alone.

Evaluation and Benchmarking Reddit Beyond Accuracy: Community Perspectives on Machine Translation

3arXiv · cs.CL·41h ago·source ↗

Framework for measuring users' mental models of machine translation quality in human-AI collaboration

A new arXiv paper introduces a cross-lingual question answering framework to study how users form mental models of speech translation systems, measuring whether users can predict where MT output is likely to be wrong. The study finds that users develop stronger mental models with practice, particularly when they have some source-language knowledge or access to speech transcriptions. Results suggest cross-lingual QA is a viable downstream task for studying human-AI collaboration in translation contexts.

Evaluation and Benchmarking Measuring User's Mental Models of Speech Translation in Human-AI Collaboration

4arXiv · cs.CL·24d ago·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Ollama GPT-5.2 DeepL +8 more

4arXiv · cs.CL·41h ago·source ↗

Cross-lingual evaluation framework reveals LLMs redistribute cultural narrative structure while preserving semantic meaning

A new arXiv preprint introduces a multilingual evaluation framework using 414 proverbs across 15 languages to assess whether LLMs preserve culturally grounded meaning when generating narratives. Using four LLMs to produce 13k narratives, the study finds that cross-lingual prompting preserves proverb-level semantic meaning but systematically redistributes agency, social positioning, and narrative structure. Strong inter-model convergence across architectures suggests multilingual LLMs rely on shared semantic abstractions. The authors argue that semantic similarity metrics alone overestimate cultural preservation in multilingual evaluations.

Evaluation and Benchmarking Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

5arXiv · cs.CL·28d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

4arXiv · cs.CL·1mo ago·source ↗

Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

This paper investigates whether LLM-based machine translation can preserve moral semantic content well enough to enable cross-lingual moral values classification, using Polish as a test case with ~50k annotated social media posts. A four-method validation pipeline (LaBSE embedding similarity, CKA, LLM-as-judge, and classifier parity) shows mean cosine similarity of 0.86 and AUC gaps of only 0.01–0.02 across Moral Foundations categories. The results suggest machine translation is a practical path to extending moral values NLP research to under-resourced languages, with expected generalization to related Slavic languages.

Evaluation and Benchmarking Moral Foundations Theory Centered Kernel Alignment LLM-as-a-Judge +2 more

4arXiv · cs.CL·1mo ago·source ↗

Study: LLM-Derived Error Highlights and APE Suggestions in MT Post-Editing

Researchers conducted a controlled study with professional En-Nl translators comparing post-editing (PE) workflows augmented with LLM-derived error highlights and automatic post-editing (APE) correction suggestions against regular PE and QE-derived highlights. No condition produced measurable productivity or quality gains over standard PE. However, APE-derived highlights were preferred over QE-derived highlights, and correction suggestions improved subjective user experience.

Evaluation and Benchmarking Enterprise Deployment Patterns large language models Automatic Post-Editing (APE)Machine Translation (MT)+1 more

5arXiv · cs.CL·28d ago·source ↗

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

This paper compares matched LLM and VLM pairs in a text-only setting to isolate the effect of multimodal training history on human-like language processing. Using whole-cortex fMRI and eye-tracking data from natural reading, the authors find that multimodal pretraining does not confer a uniform global advantage in human alignment. However, VLMs show selective advantages when sentences contain stronger visual semantic content, with converging evidence from both neural and behavioral measures. The findings suggest language-internal representations remain the primary driver of human text processing alignment.

Evaluation and Benchmarking Alignment and RLHF large language models human alignment (neural/behavioral)fMRI +4 more