Almanac
dataset

KletterMix

datasetactiveprovisionalklettermix-56cf6665·1 events·first seen 14d ago

Aliases: KletterMix

Co-occurring entities

More like this (12)

Recent events (1)

4arXiv · cs.CL·14d ago·source ↗

KletterMix: High-quality German pretraining corpus built via translation of English data

Researchers introduce KletterMix, a German-language pretraining corpus constructed by translating a state-of-the-art English pretraining dataset while preserving document structure, metadata, and topical diversity. The corpus is evaluated using COMETKiwi for translation quality and validated through controlled pretraining and annealing ablations against existing German corpora. Models trained on KletterMix show measurable improvements on German-language downstream evaluations, suggesting that carefully curated translated data can meaningfully advance non-English pretraining data ecosystems.