3arXiv cs.CL (Computation and Language)·40h ago

Framework for measuring users' mental models of machine translation quality in human-AI collaboration

A new arXiv paper introduces a cross-lingual question answering framework to study how users form mental models of speech translation systems, measuring whether users can predict where MT output is likely to be wrong. The study finds that users develop stronger mental models with practice, particularly when they have some source-language knowledge or access to speech transcriptions. Results suggest cross-lingual QA is a viable downstream task for studying human-AI collaboration in translation contexts.

Evaluation and Benchmarking Measuring User's Mental Models of Speech Translation in Human-AI Collaboration

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·16d ago·source ↗

Large-scale social media analysis reveals stakeholder conflicts over machine translation priorities

Researchers analyze 79,286 social media posts from Reddit, Facebook, Bluesky, and Mastodon (2019–2025) to compare how four communities—AI developers, professional translators, language learners, and language service providers—discuss machine translation. The study finds significant disagreements and polarized sentiments across groups, with AI researchers framing MT as a technical benchmark problem while non-AI users prioritize quality nuances, trust, reliability, and social concerns. The work argues for redirecting MT research toward community-identified needs rather than benchmark performance alone.

Evaluation and Benchmarking Reddit Beyond Accuracy: Community Perspectives on Machine Translation

5arXiv · cs.CL·10h ago·source ↗

Study finds readers prefer human literary translations over LLM-based MT, but cannot reliably distinguish them

A new arXiv paper presents a reader-centered evaluation of AI vs. human literary translation across 15 novels in French, Polish, and Japanese translated into English. Fifteen avid readers compared human translations (HT) to machine translations (MT) from an agentic LLM pipeline, finding MT 'fine' but preferring HT for ease, clarity, and immersiveness—especially at the chunk level (522/772 preferences). Critically, readers could not reliably identify which version was human-produced (17/30 correct), and automatic metrics including LLM-as-a-judge consistently favored MT over HT, diverging from human preference. The authors release LAIT, a dataset with 1K reader comments, 2K judgments, and 7.2K span-level annotations.

Evaluation and Benchmarking LAIT (Literary AI Translation)AI translation of literary texts is "fine", but readers still prefer human translations

6Openai Blog·1mo ago·source ↗

TruthfulQA: Measuring how models mimic human falsehoods

OpenAI introduced TruthfulQA, a benchmark designed to measure whether language models generate truthful answers or mimic common human misconceptions and falsehoods. The benchmark tests models on questions where humans frequently give wrong answers due to misconceptions, conspiracy theories, or false beliefs. Results showed that larger models were not necessarily more truthful, and in some cases performed worse, highlighting a key alignment challenge.

Evaluation and Benchmarking AI Safety Research TruthfulQA Stephanie Lin Jacob Hilton +3 more

5arXiv · cs.CL·28d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

6arXiv · cs.CL·15d ago·source ↗

ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing

A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.

Evaluation and Benchmarking ParaEval

4arXiv · cs.CL·7d ago·source ↗

Empirical study of LLM medical domain adaptation trade-offs in French QA

Researchers present a systematic comparison of continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for adapting LLMs to French medical question answering. The study spans three model families, multiple sizes, and three initialization types, evaluating both multiple-choice and open-ended QA formats. Key findings: CPT+SFT yields the best MCQA scores but gains over SFT alone are often not statistically significant, making SFT a cost-effective default; for open-ended QA, CPT improves overlap metrics while SFT degrades generation quality. Cross-lingual transfer from French adaptation to English benchmarks is also demonstrated.

Evaluation and Benchmarking Enterprise Deployment Patterns Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

5arXiv · cs.CL·8d ago·source ↗

Fine-tuning LLMs to passively estimate depression severity from AI mental health conversations

Researchers fine-tune a Qwen3.5-27B model with a regression head to predict PHQ-9 depression severity scores directly from AI mental health app conversation transcripts, eliminating the need for explicit self-report completion. The training set of 6,283 users combines 3,111 ground-truth labels with pseudolabels generated by Claude Opus and iterative intermediate models. On a held-out test of 842 users, the best model achieves MAE=2.6, Pearson r=0.80, and AUC=0.91 at the clinical PHQ-9≥10 threshold, with AUC>0.87 across all severity thresholds. The work demonstrates a passive, continuous symptom-monitoring approach that could reduce response bias in mental health platforms.

Enterprise Deployment Patterns Claude Opus 4.6 Patient Health Questionnaire-9 Qwen3.6-27B +1 more

5Openai Blog·1mo ago·source ↗

Why Language Models Hallucinate

OpenAI published research explaining the mechanisms behind language model hallucination. The work connects improved evaluation methods to enhanced AI reliability, honesty, and safety. The body is sparse on technical detail, but the framing positions this as foundational research relevant to alignment and deployment trust.

Evaluation and Benchmarking AI Safety Research hallucination (LLM)OpenAI +1 more