L3Cube-MahaPOS: Gold-standard POS tagging dataset and BERT models for Marathi
Researchers introduce L3Cube-MahaPOS, a manually annotated part-of-speech tagging dataset for Marathi comprising 32,354 sentences drawn from news text, using a 16-tag Universal Dependencies-aligned scheme. The work benchmarks six model families including HMM, CRF, BiLSTM variants, MuRIL, and the Marathi-specific transformer MahaBERT-v2, with the best system achieving 88.67% token-level accuracy and 81.67% macro-F1. The dataset, annotation guidelines, and model checkpoints are released publicly to support further research in a severely under-resourced language spoken by over 83 million people.
Related guides (1)
Related events (8)
IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus
Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.
Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3
This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.
Manga109-v2026: Revised Benchmark Dataset for Manga OCR and Multimodal Understanding
Researchers revisit the widely-used Manga109 dataset and identify five categories of annotation issues including transcription errors, missing text regions, and under-segmented speech balloons. They construct Manga109-v2026 by combining OCR-based issue detection with manual revision, correcting approximately 29,000 dialogue annotations. The updated dataset is intended to better align with modern OCR and multimodal manga understanding systems while preserving manga-specific expressive structures.
Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech
This paper presents the first NLP-based dementia detection study for Filipino speech, constructing a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manual Filipino translations. Five model families are evaluated across monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. English-trained BERT degrades sharply on Filipino (Macro-F1 = 0.455), but bilingual fine-tuning recovers performance to Macro-F1 = 0.969–0.973 across all transformer models. The key finding is that multilingual clinical NLP performance is driven by linguistic coverage during training rather than model scale or architecture.
Meta releases Llama Prompt Guard 2 (22M) safety classifier on Hugging Face
Meta released Llama Prompt Guard 2-22M, a lightweight 22-million-parameter text classification model for prompt safety, published on Hugging Face under the meta-llama organization. The model is based on DeBERTa-v2 architecture and tagged for safety use cases including prompt injection and jailbreak detection. It is part of the Llama 4 safety tooling ecosystem and supports English and French.
Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs
This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.
Mistral Saba: 24B Regional Language Model for Middle East and South Asia
Mistral AI has released Mistral Saba, a 24B parameter model specialized for Arabic and South Asian languages, with particular strength in South Indian languages such as Tamil. The model is trained on curated datasets from the Middle East and South Asia, and claims to outperform models more than 5x its size on regional tasks while running on single-GPU systems at over 150 tokens/second. It is available via API and for local on-premises deployment, targeting enterprise use cases in conversational support, domain-specific expertise, and cultural content creation. Mistral also announced a custom private model training offering for strategic enterprise customers.
Meta releases Llama 3.2 11B Vision Instruct multimodal model
Meta released Llama 3.2 11B Vision Instruct on Hugging Face, an open-weights multimodal model supporting image-text-to-text tasks. The model is part of the Llama 3.2 family and supports English and German. With over 157K downloads and 1,600 likes, it has seen substantial community adoption.
