Almanac
dataset

CzechDocs

datasetactiveprovisionalczechdocs-cdc71cba·1 events·first seen 47h ago

Aliases: CzechDocs

More like this (12)

Recent events (1)

3arXiv · cs.CL·47h ago·source ↗

CzechDocs: Multiway parallel dataset for format-preserving machine translation of minority languages

CzechDocs is a new multiway parallel dataset of formatted documents (HTML, DOCX, PDF) covering Czech, Ukrainian, English, Vietnamese, Russian, and other minority languages used in Czechia. The dataset is designed to evaluate machine translation systems that preserve document formatting during translation. A validation split and evaluation toolkit are publicly released; a held-out test split is reserved for a future shared task.