4arXiv cs.CL (Computation and Language)·27h ago

BamiBERT: New BERT-based language model sets state of the art for Vietnamese NLP

Qualcomm AI Research introduces BamiBERT, a BERT-based encoder pre-trained from scratch on 129GB of Vietnamese text for 20 epochs, supporting up to 2048-token context without requiring external word segmentation. It outperforms PhoBERT, the previous de facto Vietnamese encoder, achieving best scores on 11 of 15 metrics across 8 Vietnamese benchmarks. The model is released publicly on Hugging Face.

Open Weights Progress Qualcomm AI Research PhoBERT BamiBERT

Related guides (1)

Open Weights ProgressTopic guide

Open Weights Progress: How Free AI Models Caught Up to the Frontier

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·May 19, 2026·source ↗

mmBERT: ModernBERT goes Multilingual

Hugging Face introduces mmBERT, a multilingual extension of ModernBERT. The post describes adapting the ModernBERT architecture for multilingual text encoding tasks. This represents an incremental but meaningful expansion of the ModernBERT family to cover non-English languages.

Frontier Model Releases Open Weights Progress mmBERT ModernBERT Hugging Face

4arXiv · cs.CL·May 26, 2026·source ↗

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

This paper presents the first NLP-based dementia detection study for Filipino speech, constructing a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manual Filipino translations. Five model families are evaluated across monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. English-trained BERT degrades sharply on Filipino (Macro-F1 = 0.455), but bilingual fine-tuning recovers performance to Macro-F1 = 0.969–0.973 across all transformer models. The key finding is that multilingual clinical NLP performance is driven by linguistic coverage during training rather than model scale or architecture.

Evaluation and Benchmarking TF-IDF + Logistic Regression NeoBERT DementiaBank +4 more

3arXiv · cs.CL·Jun 19, 2026·source ↗

IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus

Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.

Evaluation and Benchmarking Sepahr-Danesh RoBERTa PQuAD +3 more

3Hugging Face Blog·May 19, 2026·source ↗

Pre-Train BERT with Hugging Face Transformers and Habana Gaudi

This Hugging Face blog post from August 2022 describes how to pre-train a BERT model from scratch using the Hugging Face Transformers library on Habana Gaudi hardware accelerators. It covers the full pipeline including data preparation, tokenizer training, and masked language modeling pretraining. The post serves as both a technical tutorial and a demonstration of Habana Gaudi's viability as an alternative AI training accelerator.

Training Infrastructure Habana Gaudi Hugging Face Transformers Hugging Face +2 more

6Hugging Face Blog·May 19, 2026·source ↗

Finally, a Replacement for BERT: Introducing ModernBERT

Hugging Face introduces ModernBERT, a modernized encoder-only transformer model designed as a successor to BERT. The model incorporates architectural improvements developed since BERT's 2018 release, targeting better performance on downstream NLP tasks. ModernBERT aims to fill the gap for efficient encoder models in retrieval, classification, and other discriminative tasks where decoder-only LLMs are often overkill.

Open Weights Progress Inference Economics ModernBERT Hugging Face BERT +1 more

8Qwen Research·May 18, 2026·source ↗

Qwen2.5-LLM: Alibaba releases open-weight language models from 0.5B to 72B

Alibaba's Qwen team releases the Qwen2.5 series of decoder-only dense language models, open-sourcing seven variants spanning 0.5B to 72B parameters. The release targets production use cases in the 10-30B range and mobile deployments at 3B scale. This represents a significant expansion of the open-weights frontier from a Tier 1 Chinese AI lab.

Frontier Model Releases Open Weights Progress Qwen2.5 Alibaba Qwen Team +4 more

7Meta Llama·Jun 10, 2026·source ↗

Meta releases Llama 4 Maverick 17B-128E multimodal instruct model on Hugging Face

Meta released Llama 4 Maverick, a 17B active parameter model with 128 experts (MoE architecture), as an image-text-to-text instruct model on Hugging Face. The model supports multimodal inputs and multiple languages including Arabic, German, and English. With 28K+ downloads and 493 likes shortly after release, it is seeing significant early adoption.

Frontier Model Releases Open Weights Progress Llama 4 Maverick 17B-128E Hugging Face Meta +1 more

3arXiv · cs.CL·Jun 24, 2026·source ↗

L3Cube-MahaPOS: Gold-standard POS tagging dataset and BERT models for Marathi

Researchers introduce L3Cube-MahaPOS, a manually annotated part-of-speech tagging dataset for Marathi comprising 32,354 sentences drawn from news text, using a 16-tag Universal Dependencies-aligned scheme. The work benchmarks six model families including HMM, CRF, BiLSTM variants, MuRIL, and the Marathi-specific transformer MahaBERT-v2, with the best system achieving 88.67% token-level accuracy and 81.67% macro-F1. The dataset, annotation guidelines, and model checkpoints are released publicly to support further research in a severely under-resourced language spoken by over 83 million people.

Evaluation and Benchmarking MahaBERT-v2 L3Cube L3Cube-MahaPOS +1 more

BamiBERT: New BERT-based language model sets state of the art for Vietnamese NLP

Related events (8)

5Hugging Face Blog·May 19, 2026·source ↗