dataset
CzechDocs
datasetactiveprovisional
czechdocs-cdc71cba·1 events·first seen 47h agoAliases: CzechDocs
More like this (12)
Recent events (1)
CzechDocs: Multiway parallel dataset for format-preserving machine translation of minority languages
CzechDocs is a new multiway parallel dataset of formatted documents (HTML, DOCX, PDF) covering Czech, Ukrainian, English, Vietnamese, Russian, and other minority languages used in Czechia. The dataset is designed to evaluate machine translation systems that preserve document formatting during translation. A validation split and evaluation toolkit are publicly released; a held-out test split is reserved for a future shared task.