3arXiv cs.CL (Computation and Language)·47h ago

IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus

Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.

Evaluation and Benchmarking Sepahr-Danesh RoBERTa PQuAD IHUBERT FarsTail ParsiNLU-RC

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

3Hugging Face Blog·1mo ago·source ↗

Pre-Train BERT with Hugging Face Transformers and Habana Gaudi

This Hugging Face blog post from August 2022 describes how to pre-train a BERT model from scratch using the Hugging Face Transformers library on Habana Gaudi hardware accelerators. It covers the full pipeline including data preparation, tokenizer training, and masked language modeling pretraining. The post serves as both a technical tutorial and a demonstration of Habana Gaudi's viability as an alternative AI training accelerator.

Training Infrastructure Habana Gaudi Hugging Face Transformers Hugging Face +2 more

4arXiv · cs.CL·19d ago·source ↗

IndicBERT-HPA: Reliability-Oriented Multilingual Orthopedic Decision Support with Selective Verification Deferral

This paper presents a framework for classifying free-text orthopedic clinical notes in English, Hindi, and Punjabi, introducing IndicBERT-HPA, a domain-adaptive encoder augmented with language-aware orthopedic adapter heads. The system is evaluated against multilingual transformers, a DistilBERT baseline, and zero-shot LLMs, with zero-shot LLMs found substantially less effective than task-adapted encoders for closed-set clinical classification. IndicBERT-HPA achieves Macro-F1 of 0.8792 and AUPRC of 0.902 under natural clinical prevalence. A deterministic selective-verification layer combining confidence gating, evidence-consistency checking, and language-risk screening improves accuracy from 71.5% to 84.4% at 72.3% coverage on a 5,000-record held-out set.

Evaluation and Benchmarking Enterprise Deployment Patterns confidence gating language-aware adapter heads IndicBERT-HPA +3 more

6arXiv · cs.CL·2d ago·source ↗

Sumi: First open 7B uniform diffusion language model pretrained from scratch at scale

Researchers introduce Sumi, a fully open 7B uniform diffusion language model (UDLM) pretrained from scratch on 1.5 trillion tokens — the first UDLM at both large parameter scale and large token budget. Sumi performs competitively with autoregressive models on knowledge, reasoning, and coding benchmarks, though underperforms on commonsense tasks, attributed partly to an education-heavy data mixture. Model weights, checkpoints, and full training recipe including data mixture specification are released publicly. The work fills a gap in the diffusion language model landscape, providing a reference point for studying scaling behavior and generation dynamics in uniform diffusion.

Frontier Model Releases Open Weights Progress Sumi Sumi: Open Uniform Diffusion Language Model from Scratch

5Hugging Face Blog·1mo ago·source ↗

Introducing Falcon-H1-Arabic: Pushing the Boundaries of Arabic Language AI with Hybrid Architecture

TII UAE (Technology Innovation Institute) has released Falcon-H1-Arabic, a new language model specifically optimized for Arabic language tasks using a hybrid architecture. The model builds on the Falcon-H1 lineage and targets improved Arabic NLP capabilities. This release represents a focused effort to advance Arabic-language AI beyond general multilingual models.

Frontier Model Releases Open Weights Progress Falcon-H1-Arabic Hugging Face Falcon-H1 +1 more

6Hugging Face Blog·1mo ago·source ↗

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

TII UAE has released Falcon-H1, a new family of hybrid-head language models combining attention and state-space mechanisms to improve efficiency and performance. The models are published on Hugging Face and represent TII's latest iteration in the Falcon series. The hybrid architecture targets better inference economics and competitive benchmark results relative to model size.

Frontier Model Releases Open Weights Progress Hugging Face Hybrid-Head Architecture Falcon-H1 +2 more

4arXiv · cs.CL·11d ago·source ↗

Corpus-Grounded Feature Diffusion pipeline for automated IEP generation in Traditional Chinese

Researchers propose a low-resource fine-tuning pipeline called Corpus-Grounded Feature Diffusion (CGFD) to automate Individualized Education Program (IEP) drafting from Traditional Chinese parent-teacher interview transcripts. The approach fine-tunes Breeze-7B with QLoRA on 582 synthetically diffused samples and uses schema-constrained decoding at inference time, finding that Grammar-Constrained Decoding is counterproductive under Traditional Chinese token budgets. On a small formal hold-out (n=10), the system achieves BERTScore F1 of 0.779, outperforming zero-shot GPT-5.4, DeepSeek-V3.2, Gemini-3-Flash-Preview, and Llama-4-Maverick baselines while enabling fully local, air-gapped inference. The work addresses a gap in Traditional Chinese special-education NLP and demonstrates a privacy-preserving deployment pattern for sensitive document generation.

Evaluation and Benchmarking Enterprise Deployment Patterns DeepSeek V4 Corpus-Grounded Feature Diffusion Grammar-Constrained Decoding +6 more

4arXiv · cs.CL·25d ago·source ↗

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

This paper presents the first NLP-based dementia detection study for Filipino speech, constructing a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manual Filipino translations. Five model families are evaluated across monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. English-trained BERT degrades sharply on Filipino (Macro-F1 = 0.455), but bilingual fine-tuning recovers performance to Macro-F1 = 0.969–0.973 across all transformer models. The key finding is that multilingual clinical NLP performance is driven by linguistic coverage during training rather than model scale or architecture.

Evaluation and Benchmarking TF-IDF + Logistic Regression NeoBERT DementiaBank +4 more

5Hugging Face Blog·1mo ago·source ↗

mmBERT: ModernBERT goes Multilingual

Hugging Face introduces mmBERT, a multilingual extension of ModernBERT. The post describes adapting the ModernBERT architecture for multilingual text encoding tasks. This represents an incremental but meaningful expansion of the ModernBERT family to cover non-English languages.

Frontier Model Releases Open Weights Progress mmBERT ModernBERT Hugging Face