Qualcomm AI Research introduces BamiBERT, a BERT-based encoder pre-trained from scratch on 129GB of Vietnamese text for 20 epochs, supporting up to 2048-token context without requiring external word segmentation. It outperforms PhoBERT, the previous de facto Vietnamese encoder, achieving best scores on 11 of 15 metrics across 8 Vietnamese benchmarks. The model is released publicly on Hugging Face.
Hugging Face introduces mmBERT, a multilingual extension of ModernBERT. The post describes adapting the ModernBERT architecture for multilingual text encoding tasks. This represents an incremental but meaningful expansion of the ModernBERT family to cover non-English languages.
This paper presents the first NLP-based dementia detection study for Filipino speech, constructing a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manual Filipino translations. Five model families are evaluated across monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. English-trained BERT degrades sharply on Filipino (Macro-F1 = 0.455), but bilingual fine-tuning recovers performance to Macro-F1 = 0.969–0.973 across all transformer models. The key finding is that multilingual clinical NLP performance is driven by linguistic coverage during training rather than model scale or architecture.
Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.
This Hugging Face blog post from August 2022 describes how to pre-train a BERT model from scratch using the Hugging Face Transformers library on Habana Gaudi hardware accelerators. It covers the full pipeline including data preparation, tokenizer training, and masked language modeling pretraining. The post serves as both a technical tutorial and a demonstration of Habana Gaudi's viability as an alternative AI training accelerator.
Hugging Face introduces ModernBERT, a modernized encoder-only transformer model designed as a successor to BERT. The model incorporates architectural improvements developed since BERT's 2018 release, targeting better performance on downstream NLP tasks. ModernBERT aims to fill the gap for efficient encoder models in retrieval, classification, and other discriminative tasks where decoder-only LLMs are often overkill.
Alibaba's Qwen team releases the Qwen2.5 series of decoder-only dense language models, open-sourcing seven variants spanning 0.5B to 72B parameters. The release targets production use cases in the 10-30B range and mobile deployments at 3B scale. This represents a significant expansion of the open-weights frontier from a Tier 1 Chinese AI lab.
Meta released Llama 4 Maverick, a 17B active parameter model with 128 experts (MoE architecture), as an image-text-to-text instruct model on Hugging Face. The model supports multimodal inputs and multiple languages including Arabic, German, and English. With 28K+ downloads and 493 likes shortly after release, it is seeing significant early adoption.
Researchers introduce L3Cube-MahaPOS, a manually annotated part-of-speech tagging dataset for Marathi comprising 32,354 sentences drawn from news text, using a 16-tag Universal Dependencies-aligned scheme. The work benchmarks six model families including HMM, CRF, BiLSTM variants, MuRIL, and the Marathi-specific transformer MahaBERT-v2, with the best system achieving 88.67% token-level accuracy and 81.67% macro-F1. The dataset, annotation guidelines, and model checkpoints are released publicly to support further research in a severely under-resourced language spoken by over 83 million people.