paper
Chi nas dal soch el sent de legn -- Auditing Text Corpora for Lombard
paperactiveprovisional
chi-nas-dal-soch-el-sent-de-legn-auditing-text-corpora-for-lombard-3e3fae02·1 events·first seen 12d agoAliases: Chi nas dal soch el sent de legn -- Auditing Text Corpora for Lombard
More like this (12)
human-LLM collaborative annotationThe Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language ModelsEIT-NLPReeve Foundation Multilingual CorpusBeyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Researchdatasette-llmSigLIP2Leveraging Audio-LLMs to Filter Speech-to-Speech Training DataLLM-augmented clinical NLP pipelineGLM-OCRWhich Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMsGLM-4-Voice
Recent events (1)
Audit of Lombard language corpora reveals pervasive data quality and representational bias problems
Researchers conducted a manual audit of parallel and monolingual corpora available for Lombard, a low-resource language continuum from northern Italy. The study finds that web-scraped datasets suffer from severe language misidentification, boilerplate text, and non-linguistic noise, making apparent data abundance illusory. Additionally, high-quality data is heavily skewed toward Western Lombard varieties, leaving Eastern varieties underrepresented. The authors argue for variety-aware, community-driven curation over quantity-driven scraping.