Almanac
paper

Chi nas dal soch el sent de legn -- Auditing Text Corpora for Lombard

paperactiveprovisionalchi-nas-dal-soch-el-sent-de-legn-auditing-text-corpora-for-lombard-3e3fae02·1 events·first seen 12d ago

Aliases: Chi nas dal soch el sent de legn -- Auditing Text Corpora for Lombard

More like this (12)

Recent events (1)

3arXiv · cs.CL·12d ago·source ↗

Audit of Lombard language corpora reveals pervasive data quality and representational bias problems

Researchers conducted a manual audit of parallel and monolingual corpora available for Lombard, a low-resource language continuum from northern Italy. The study finds that web-scraped datasets suffer from severe language misidentification, boilerplate text, and non-linguistic noise, making apparent data abundance illusory. Additionally, high-quality data is heavily skewed toward Western Lombard varieties, leaving Eastern varieties underrepresented. The authors argue for variety-aware, community-driven curation over quantity-driven scraping.