Urdu Katib Handwritten Dataset: Historical Urdu HTR benchmark with CRNN baseline evaluation
Researchers introduce the Urdu Katib Handwritten Dataset (UKHD), the first offline Urdu handwritten text lines dataset drawn from historical Katib (scribe) materials in the Nastalique calligraphic style. The paper evaluates several CRNN-based hybrid architectures for Urdu Handwritten Text Recognition, finding that a CNN-BGRU-CTC model achieves the best character and word error rates. The work addresses a recognized gap in cursive-script HTR research caused by the scarcity of benchmark datasets for Urdu.
Related guides (1)
Related events (8)
IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus
Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.
Tatoxa: State-of-the-art text detoxification system for the low-resource Tatar language
Researchers introduce Tatoxa, a text detoxification system for the Tatar language, along with a new fine-tuning and evaluation dataset for this low-resource setting. Comparative experiments show Tatoxa outperforms both open-source and proprietary LLMs on quality metrics. Cross-lingual transfer experiments find that even culturally close Russian data transfers poorly compared to native Tatar training data, highlighting the limits of cross-lingual approaches for low-resource languages.
CN-NewsTTS Bench: automatic benchmark for Chinese news TTS pronunciation of complex written forms
Researchers introduce CN-NewsTTS Bench v0.1, an open benchmark for evaluating Chinese news text-to-speech systems on challenging written forms such as scores, abbreviations, unit symbols, and mixed-script names — all from raw text without preprocessing aids. The benchmark includes a 200-record dev set, 800-record public test set, an automatic scorer, and baseline results for seven commercial TTS systems. Best-in-class accuracy reaches 0.879 strict accuracy while several systems fall below 0.60, revealing meaningful performance gaps on a practically important but underexplored evaluation dimension.
Thaka Wins KSAA-2026 Arabic Speech Diacritization Task with Regularized Fine-Tuning of CATT-Whisper
The Thaka team describes their winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization, which requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts. Their approach fine-tunes CATT-Whisper, a multimodal model combining a CATT text encoder with a frozen Whisper speech encoder, under severe data constraints (2,327 training samples, no external data). Key techniques include R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, Focal Loss, and Monte Carlo Dropout inference averaging over 200 stochastic forward passes across four checkpoints. The system achieves 23.26% WER on the primary metric, placing first among all participants.
UniCAD: Unified benchmark and multimodal LLM for multi-task CAD learning
Researchers introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning covering point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering. Alongside the benchmark, they present UniCAD-MLLM, a single end-to-end multimodal large language model that ingests text, images, sketches, and point clouds to perform all these tasks. The system achieves state-of-the-art results on both UniCAD and Fusion360 benchmarks, outperforming task-specific and multi-task baselines. Dataset, code, and pretrained models are to be released.
CANDLE: Lightweight CTC-based Arabic character deduplication for social media text normalization
CANDLE is a lightweight Arabic text normalization system that uses Connectionist Temporal Classification (CTC) to deduplicate informal character elongation without handcrafted rules or morphological analyzers. Evaluated on three benchmarks including social media text, the CTC model achieves 5.37% Sentence Error Rate and is distilled from 6 layers to 2 with minimal performance loss. A key downstream benefit is up to 12.8% reduction in tokenizer fertility across Arabic LLM tokenizers, lowering inference costs and improving context window utilization. Code and models are publicly released.
L3Cube-MahaPOS: Gold-standard POS tagging dataset and BERT models for Marathi
Researchers introduce L3Cube-MahaPOS, a manually annotated part-of-speech tagging dataset for Marathi comprising 32,354 sentences drawn from news text, using a 16-tag Universal Dependencies-aligned scheme. The work benchmarks six model families including HMM, CRF, BiLSTM variants, MuRIL, and the Marathi-specific transformer MahaBERT-v2, with the best system achieving 88.67% token-level accuracy and 81.67% macro-F1. The dataset, annotation guidelines, and model checkpoints are released publicly to support further research in a severely under-resourced language spoken by over 83 million people.
Manga109-v2026: Revised Benchmark Dataset for Manga OCR and Multimodal Understanding
Researchers revisit the widely-used Manga109 dataset and identify five categories of annotation issues including transcription errors, missing text regions, and under-segmented speech balloons. They construct Manga109-v2026 by combining OCR-based issue detection with manual revision, correcting approximately 29,000 dialogue annotations. The updated dataset is intended to better align with modern OCR and multimodal manga understanding systems while preserving manga-specific expressive structures.
