CzechDocs: Multiway parallel dataset for format-preserving machine translation of minority languages
CzechDocs is a new multiway parallel dataset of formatted documents (HTML, DOCX, PDF) covering Czech, Ukrainian, English, Vietnamese, Russian, and other minority languages used in Czechia. The dataset is designed to evaluate machine translation systems that preserve document formatting during translation. A validation split and evaluation toolkit are publicly released; a held-out test split is reserved for a future shared task.
Related guides (1)
Related events (8)
BenCzechMark: A Benchmark for Evaluating LLM Czech Language Understanding
BenCzechMark is a new evaluation benchmark designed to assess large language model performance on Czech language tasks. The benchmark addresses the gap in non-English language evaluation, providing a structured way to measure LLM capabilities in Czech across multiple task types. Published on Hugging Face, it contributes to the growing ecosystem of multilingual and language-specific benchmarks.
Docmatix: A Large-Scale Dataset for Document Visual Question Answering
Hugging Face released Docmatix, a large-scale dataset designed for Document Visual Question Answering (DocVQA) tasks. The dataset aims to address the scarcity of high-quality training data for document understanding in multimodal models. It is intended to improve fine-tuning of vision-language models on document comprehension tasks.
First Komi-Yazva–Russian parallel corpus and LLM translation evaluation protocol for endangered low-resource language
Researchers introduce the first Komi-Yazva–Russian parallel corpus of 457 aligned sentence pairs from 74 narrative texts, paired with a rigorous evaluation protocol for studying LLM translation under extreme data scarcity. The protocol includes story-level cross-validation, deterministic retrieval-based few-shot prompting, and both reference-based and judge-based metrics to ensure leakage-aware, reproducible evaluation. Results show LLMs produce non-trivial translations but performance varies strongly by model family; retrieval-based few-shot prompting consistently outperforms zero-shot, though gains plateau quickly. The work frames the corpus as both a dataset contribution and a reproducible testbed for endangered-language machine translation research.
SkMTEB: First comprehensive MTEB-style text embedding benchmark for Slovak with adapted E5 models
Researchers introduce SkMTEB, the first MTEB-style embedding benchmark for Slovak, covering 31 datasets across 7 task types — roughly 4× the existing multilingual benchmark coverage for the language. Evaluation of 31 embedding models shows large instruction-tuned multilingual models outperform Slovak-specific NLU models on embedding tasks. The authors also release e5-sk-small (45M) and e5-sk-large (365M), derived from Multilingual E5 via vocabulary trimming and fine-tuning, achieving competitive performance with proprietary APIs at up to 62% size reduction.
CUNI submits 1B-parameter simultaneous speech translation system to IWSLT 2026
Researchers from CUNI submit a simultaneous speech translation system to the IWSLT 2026 shared task, built on the offline Canary model with the AlignAtt policy. The system covers Czech-English and English-German/Italian translation pairs, supports 25 source and 25 target languages, and outperforms similarly sized baselines in both low- and high-latency regimes. At 1B parameters, it is positioned as a compact, multilingual, computationally efficient solution.
TABVERSE benchmark isolates table representation effects across formats in LLMs and VLMs
TABVERSE is a new controlled multimodal benchmark that evaluates LLMs and VLMs on table understanding by holding table content fixed while varying representation format (HTML, Markdown, LaTeX, rendered images). Evaluation across three tasks—Question Answering, Structural Understanding, and Structure Reconstruction—shows that representation choice substantially affects performance, with structured text generally outperforming rendered images and HTML being the most robust text format. The benchmark addresses a gap in existing evaluations where content, format, and modality vary simultaneously, making it impossible to isolate representation effects.
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.
Fifth Shared Task on Multilingual Coreference Resolution: Long-Range Entities and LLM Participation
The fifth CODI-CRAC shared task on multilingual coreference resolution expanded its scope with five new datasets and two additional languages, leveraging CorefUD 1.4 covering 27 datasets across 19 languages. The 2026 edition emphasized long-range coreference chains spanning many words and sentences. Ten systems participated, including four LLM-based approaches; traditional systems still led but LLMs showed notable potential, suggesting competitive parity may be near.
