3arXiv cs.CL (Computation and Language)·15d ago

Audit of Lombard language corpora reveals pervasive data quality and representational bias problems

Researchers conducted a manual audit of parallel and monolingual corpora available for Lombard, a low-resource language continuum from northern Italy. The study finds that web-scraped datasets suffer from severe language misidentification, boilerplate text, and non-linguistic noise, making apparent data abundance illusory. Additionally, high-quality data is heavily skewed toward Western Lombard varieties, leaving Eastern varieties underrepresented. The authors argue for variety-aware, community-driven curation over quantity-driven scraping.

Evaluation and Benchmarking Chi nas dal soch el sent de legn -- Auditing Text Corpora for Lombard

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·10d ago·source ↗

Audit finds cultural translation failures and diversity collapse in LLM-adapted math word problems across 7 languages

Researchers audited how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into seven languages spanning South Asia and Italy, annotating 6,489 entity transformations. Models agreed on transformation type only 62.5% of the time and on specific substitutions in just 33.5% of cases, meaning model choice substantially shapes the cultural world students encounter. All 21 language-model combinations exhibited 'entropy collapse'—adaptations compressed rather than expanded cultural diversity—and models produced systematic regional misattributions (e.g., Bangladeshi currency for Indian Bengali students) and cross-cultural contamination (e.g., egg hunts framed as Eid activities). The study highlights that surface plausibility masks deeper corpus-level failures invisible in individual translations.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions Google +4 more

4arXiv · cs.CL·25d ago·source ↗

WhoSaidIt: Human-LLM Collaborative Annotation for Multilingual Speaker-Attribute Classification

This paper proposes a human-LLM collaborative re-annotation framework for stabilizing noisy multilingual speaker-attribute labels under resource constraints. LLMs surface recurring annotation rationales through iterative expert interaction, combined with disagreement-focused sampling for targeted re-annotation. The resulting WhoSaidIt dataset covers nine speaker-attribute labels across multiple languages. Benchmarking of recent LLMs reveals substantial cross-lingual annotation divergence and highlights both capabilities and limitations of LLMs in this classification task.

Evaluation and Benchmarking Agent and Tool Ecosystem human-LLM collaborative annotation speaker-attribute classification WhoSaidIt +1 more

3arXiv · cs.CL·15d ago·source ↗

First Komi-Yazva–Russian parallel corpus and LLM translation evaluation protocol for endangered low-resource language

Researchers introduce the first Komi-Yazva–Russian parallel corpus of 457 aligned sentence pairs from 74 narrative texts, paired with a rigorous evaluation protocol for studying LLM translation under extreme data scarcity. The protocol includes story-level cross-validation, deterministic retrieval-based few-shot prompting, and both reference-based and judge-based metrics to ensure leakage-aware, reproducible evaluation. Results show LLMs produce non-trivial translations but performance varies strongly by model family; retrieval-based few-shot prompting consistently outperforms zero-shot, though gains plateau quickly. The work frames the corpus as both a dataset contribution and a reproducible testbed for endangered-language machine translation research.

Evaluation and Benchmarking A Komi-Yazva–Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation Komi-Yazva–Russian Parallel Corpus

6arXiv · cs.CL·5d ago·source ↗

Computational audit finds ClinicalBERT amplifies demographic bias beyond training data distributions

Researchers present a systematic audit of representational bias in ClinicalBERT, a BERT-based model pretrained on MIMIC-III clinical discharge summaries, using two probing methodologies: Log Probability Bias Analysis and Masked Language Model probing across 98 clinical sentence templates and eight intersectional race-gender combinations. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing. The key finding is that bias in ClinicalBERT operates predominantly through model-internal amplification rather than simple inheritance from training data, which has direct implications for clinical AI safety and deployment. This challenges the assumption that auditing training corpora is sufficient to characterize model bias.

Evaluation and Benchmarking AI Safety Research A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions MIMIC-III ClinicalBERT +1 more

6arXiv · cs.CL·12d ago·source ↗

Study finds local languages provide better cultural knowledge access in LLMs once proficiency is controlled

A new arXiv paper introduces a controlled evaluation framework to disentangle language proficiency from culture-specific knowledge access in LLMs. Using real-world cultural questions across 13 locales and ~80 models, the authors apply item response theory to show that while English dominates on culture-agnostic questions, local languages yield a consistent knowledge-access advantage on culture-specific questions once proficiency differences are factored out. The finding challenges the common interpretation that weaker local-language accuracy implies weaker cultural knowledge, and has implications for how multilingual and regionally-aligned models are evaluated.

Evaluation and Benchmarking The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs item response theory

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

5Hacker News·23d ago·source ↗

Disagreement among frontier LLMs on real-world fact-checks

A study examines how frontier large language models diverge in their responses to real-world fact-checking queries, surfacing systematic disagreements across models on factual claims. The work appears to benchmark multiple leading models against a set of verifiable facts, revealing inconsistencies that have implications for reliability and deployment. With 475 HN points and 333 comments, the piece has generated substantial community discussion. The findings are relevant to evaluation methodology, model calibration, and trust in AI-generated factual content.

Frontier Model Releases Evaluation and Benchmarking frontier LLMs lenz.io Hacker News

4Hugging Face Blog·1mo ago·source ↗

FilBench: Benchmarking LLM Capabilities in Filipino Language

FilBench is a new benchmark introduced to evaluate large language models on their ability to understand and generate Filipino. The benchmark targets a historically underrepresented language in NLP evaluation suites, assessing both comprehension and generation tasks. This work addresses gaps in multilingual LLM evaluation coverage, particularly for Southeast Asian languages.

Evaluation and Benchmarking Multimodal Progress FilBench Hugging Face Filipino