benchmark
COMETKiwi
benchmarkactiveprovisional
cometkiwi-38a8dd80·1 events·first seen 13d agoAliases: COMETKiwi
Co-occurring entities
More like this (12)
Recent events (1)
KletterMix: High-quality German pretraining corpus built via translation of English data
Researchers introduce KletterMix, a German-language pretraining corpus constructed by translating a state-of-the-art English pretraining dataset while preserving document structure, metadata, and topical diversity. The corpus is evaluated using COMETKiwi for translation quality and validated through controlled pretraining and annealing ablations against existing German corpora. Models trained on KletterMix show measurable improvements on German-language downstream evaluations, suggesting that carefully curated translated data can meaningfully advance non-English pretraining data ecosystems.