3arXiv cs.CL (Computation and Language)·12d ago

Urdu Katib Handwritten Dataset: Historical Urdu HTR benchmark with CRNN baseline evaluation

Researchers introduce the Urdu Katib Handwritten Dataset (UKHD), the first offline Urdu handwritten text lines dataset drawn from historical Katib (scribe) materials in the Nastalique calligraphic style. The paper evaluates several CRNN-based hybrid architectures for Urdu Handwritten Text Recognition, finding that a CNN-BGRU-CTC model achieves the best character and word error rates. The work addresses a recognized gap in cursive-script HTR research caused by the scarcity of benchmark datasets for Urdu.

Evaluation and Benchmarking CNN-BGRU-CTC Urdu Katib Handwritten Dataset

Related guides (1)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

3arXiv · cs.CL·11d ago·source ↗

IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus

Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.

Evaluation and Benchmarking Sepahr-Danesh RoBERTa PQuAD +3 more

3arXiv · cs.CL·5d ago·source ↗

Tatoxa: State-of-the-art text detoxification system for the low-resource Tatar language

Researchers introduce Tatoxa, a text detoxification system for the Tatar language, along with a new fine-tuning and evaluation dataset for this low-resource setting. Comparative experiments show Tatoxa outperforms both open-source and proprietary LLMs on quality metrics. Cross-lingual transfer experiments find that even culturally close Russian data transfers poorly compared to native Tatar training data, highlighting the limits of cross-lingual approaches for low-resource languages.

AI Safety Research Tatoxa The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

3arXiv · cs.CL·6d ago·source ↗

CN-NewsTTS Bench: automatic benchmark for Chinese news TTS pronunciation of complex written forms

Researchers introduce CN-NewsTTS Bench v0.1, an open benchmark for evaluating Chinese news text-to-speech systems on challenging written forms such as scores, abbreviations, unit symbols, and mixed-script names — all from raw text without preprocessing aids. The benchmark includes a 200-record dev set, 800-record public test set, an automatic scorer, and baseline results for seven commercial TTS systems. Best-in-class accuracy reaches 0.879 strict accuracy while several systems fall below 0.60, revealing meaningful performance gaps on a practically important but underexplored evaluation dimension.

Evaluation and Benchmarking CN-NewsTTS Bench

3arXiv · cs.CL·1mo ago·source ↗

Thaka Wins KSAA-2026 Arabic Speech Diacritization Task with Regularized Fine-Tuning of CATT-Whisper

The Thaka team describes their winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization, which requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts. Their approach fine-tunes CATT-Whisper, a multimodal model combining a CATT text encoder with a frozen Whisper speech encoder, under severe data constraints (2,327 training samples, no external data). Key techniques include R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, Focal Loss, and Monte Carlo Dropout inference averaging over 200 stochastic forward passes across four checkpoints. The system achieves 23.26% WER on the primary metric, placing first among all participants.

Multimodal Progress Optuna Focal Loss CATT +6 more

5arXiv · cs.AI·26d ago·source ↗

UniCAD: Unified benchmark and multimodal LLM for multi-task CAD learning

Researchers introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning covering point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering. Alongside the benchmark, they present UniCAD-MLLM, a single end-to-end multimodal large language model that ingests text, images, sketches, and point clouds to perform all these tasks. The system achieves state-of-the-art results on both UniCAD and Fusion360 benchmarks, outperforming task-specific and multi-task baselines. Dataset, code, and pretrained models are to be released.

Evaluation and Benchmarking Multimodal Progress Fusion360 UniCAD-MLLM UniCAD

3arXiv · cs.CL·6d ago·source ↗

CANDLE: Lightweight CTC-based Arabic character deduplication for social media text normalization

CANDLE is a lightweight Arabic text normalization system that uses Connectionist Temporal Classification (CTC) to deduplicate informal character elongation without handcrafted rules or morphological analyzers. Evaluated on three benchmarks including social media text, the CTC model achieves 5.37% Sentence Error Rate and is distilled from 6 layers to 2 with minimal performance loss. A key downstream benefit is up to 12.8% reduction in tokenizer fertility across Arabic LLM tokenizers, lowering inference costs and improving context window utilization. Code and models are publicly released.

Inference Economics CANDLE Connectionist Temporal Classification abjadai

3arXiv · cs.CL·6d ago·source ↗

L3Cube-MahaPOS: Gold-standard POS tagging dataset and BERT models for Marathi

Researchers introduce L3Cube-MahaPOS, a manually annotated part-of-speech tagging dataset for Marathi comprising 32,354 sentences drawn from news text, using a 16-tag Universal Dependencies-aligned scheme. The work benchmarks six model families including HMM, CRF, BiLSTM variants, MuRIL, and the Marathi-specific transformer MahaBERT-v2, with the best system achieving 88.67% token-level accuracy and 81.67% macro-F1. The dataset, annotation guidelines, and model checkpoints are released publicly to support further research in a severely under-resourced language spoken by over 83 million people.

Evaluation and Benchmarking MahaBERT-v2 L3Cube L3Cube-MahaPOS +1 more

4arXiv · cs.CL·1mo ago·source ↗

Manga109-v2026: Revised Benchmark Dataset for Manga OCR and Multimodal Understanding

Researchers revisit the widely-used Manga109 dataset and identify five categories of annotation issues including transcription errors, missing text regions, and under-segmented speech balloons. They construct Manga109-v2026 by combining OCR-based issue detection with manual revision, correcting approximately 29,000 dialogue annotations. The updated dataset is intended to better align with modern OCR and multimodal manga understanding systems while preserving manga-specific expressive structures.

Evaluation and Benchmarking Multimodal Progress Manga109-v2026 Manga109