Almanac
← Events
3arXiv cs.CL (Computation and Language)·24h ago

CANDLE: Lightweight CTC-based Arabic character deduplication for social media text normalization

CANDLE is a lightweight Arabic text normalization system that uses Connectionist Temporal Classification (CTC) to deduplicate informal character elongation without handcrafted rules or morphological analyzers. Evaluated on three benchmarks including social media text, the CTC model achieves 5.37% Sentence Error Rate and is distilled from 6 layers to 2 with minimal performance loss. A key downstream benefit is up to 12.8% reduction in tokenizer fertility across Arabic LLM tokenizers, lowering inference costs and improving context window utilization. Code and models are publicly released.

Related guides (1)

Related events (8)

3arXiv · cs.CL·29d ago·source ↗

Thaka Wins KSAA-2026 Arabic Speech Diacritization Task with Regularized Fine-Tuning of CATT-Whisper

The Thaka team describes their winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization, which requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts. Their approach fine-tunes CATT-Whisper, a multimodal model combining a CATT text encoder with a frozen Whisper speech encoder, under severe data constraints (2,327 training samples, no external data). Key techniques include R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, Focal Loss, and Monte Carlo Dropout inference averaging over 200 stochastic forward passes across four checkpoints. The system achieves 23.26% WER on the primary metric, placing first among all participants.

4Hugging Face Blog·1mo ago·source ↗

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.

4arXiv · cs.CL·9d ago·source ↗

Synthetic data generation method enables small LLMs to match large models on Text-To-Cypher tasks

A new arXiv paper presents an automatic synthetic data generation method for fine-tuning small LLMs on Text-To-Cypher (Text2Cypher) parsing, enabling natural language interfaces to property graph databases. Experiments across major Text-To-Cypher benchmarks show that small fine-tuned models can compete with much larger proprietary models. The approach is positioned as a solution for local deployment scenarios requiring data sovereignty without expensive annotation.

5Hugging Face Blog·1mo ago·source ↗

Falcon-Arabic: A Breakthrough in Arabic Language Models

TII UAE has released Falcon-Arabic, a language model specifically designed for Arabic. The announcement highlights it as a significant advancement in Arabic NLP capabilities. As a tier-2 source with minimal body content, specific technical details about model size, training data, or benchmark performance are not available from this item.

5arXiv · cs.AI·15d ago·source ↗

CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup

Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.

4arXiv · cs.CL·42h ago·source ↗

CTC oracle gap anatomy: acoustic scoring saturates, linguistic MBR decoding recovers WER

A new arXiv paper systematically diagnoses why CTC-internal N-best rescoring fails to improve over greedy decoding on LibriSpeech, showing that blank-path proliferation causes a 53% degradation in rank correlation between CTC scores and WER as beam size grows. The authors demonstrate that the bottleneck is linguistic rather than acoustic: MBR decoding with RoBERTa pseudo-log-likelihood achieves 9% relative WER reduction on LibriSpeech test-other and generalizes across two architectures and three domains. The paper also analyzes MWER sequence-level fine-tuning failure at near-converged checkpoints, attributing collapse to a vanishingly small training oracle gap.

4Hugging Face Blog·1mo ago·source ↗

3LM: A Benchmark for Arabic LLMs in STEM and Code

TII UAE has released 3LM, a benchmark designed to evaluate large language models on Arabic-language STEM and coding tasks. The benchmark addresses a gap in multilingual evaluation infrastructure, where Arabic has been underrepresented relative to English and other high-resource languages. It targets both general-purpose and Arabic-specialized LLMs to assess their capabilities in technical domains.

5arXiv · cs.CL·12d ago·source ↗

Adaptive asymmetric token compression accelerates time series language models up to 7.68×

A new arXiv preprint proposes an adaptive token budgeting framework for time series (TS) language models that compresses TS tokens using frequency-domain structure and progressively prunes prompt tokens across model layers. The authors demonstrate up to 7.68× inference acceleration with performance improvements in 78% of evaluated settings across forecasting, classification, imputation, and anomaly detection tasks. The work is motivated by the observation that TS tokens have uneven spectral contributions and prompt-token influence attenuates with model depth, making uniform token processing wasteful.