Almanac
benchmark

COMETKiwi

benchmarkactiveprovisionalcometkiwi-38a8dd80·1 events·first seen 13d ago

Aliases: COMETKiwi

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.CL·13d ago·source ↗

KletterMix: High-quality German pretraining corpus built via translation of English data

Researchers introduce KletterMix, a German-language pretraining corpus constructed by translating a state-of-the-art English pretraining dataset while preserving document structure, metadata, and topical diversity. The corpus is evaluated using COMETKiwi for translation quality and validated through controlled pretraining and annealing ablations against existing German corpora. Models trained on KletterMix show measurable improvements on German-language downstream evaluations, suggesting that carefully curated translated data can meaningfully advance non-English pretraining data ecosystems.