3arXiv cs.CL (Computation and Language)·25h ago

L3Cube-MahaPOS: Gold-standard POS tagging dataset and BERT models for Marathi

Researchers introduce L3Cube-MahaPOS, a manually annotated part-of-speech tagging dataset for Marathi comprising 32,354 sentences drawn from news text, using a 16-tag Universal Dependencies-aligned scheme. The work benchmarks six model families including HMM, CRF, BiLSTM variants, MuRIL, and the Marathi-specific transformer MahaBERT-v2, with the best system achieving 88.67% token-level accuracy and 81.67% macro-F1. The dataset, annotation guidelines, and model checkpoints are released publicly to support further research in a severely under-resourced language spoken by over 83 million people.

Evaluation and Benchmarking MahaBERT-v2 L3Cube L3Cube-MahaPOS MuRIL

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

3arXiv · cs.CL·5d ago·source ↗

IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus

Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.

Evaluation and Benchmarking Sepahr-Danesh RoBERTa PQuAD +3 more

4arXiv · cs.CL·22d ago·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more

4arXiv · cs.CL·1mo ago·source ↗

Manga109-v2026: Revised Benchmark Dataset for Manga OCR and Multimodal Understanding

Researchers revisit the widely-used Manga109 dataset and identify five categories of annotation issues including transcription errors, missing text regions, and under-segmented speech balloons. They construct Manga109-v2026 by combining OCR-based issue detection with manual revision, correcting approximately 29,000 dialogue annotations. The updated dataset is intended to better align with modern OCR and multimodal manga understanding systems while preserving manga-specific expressive structures.

Evaluation and Benchmarking Multimodal Progress Manga109-v2026 Manga109

4arXiv · cs.CL·29d ago·source ↗

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

This paper presents the first NLP-based dementia detection study for Filipino speech, constructing a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts with manual Filipino translations. Five model families are evaluated across monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. English-trained BERT degrades sharply on Filipino (Macro-F1 = 0.455), but bilingual fine-tuning recovers performance to Macro-F1 = 0.969–0.973 across all transformer models. The key finding is that multilingual clinical NLP performance is driven by linguistic coverage during training rather than model scale or architecture.

Evaluation and Benchmarking TF-IDF + Logistic Regression NeoBERT DementiaBank +4 more

5Meta Llama·15d ago·source ↗

Meta releases Llama Prompt Guard 2 (22M) safety classifier on Hugging Face

Meta released Llama Prompt Guard 2-22M, a lightweight 22-million-parameter text classification model for prompt safety, published on Hugging Face under the meta-llama organization. The model is based on DeBERTa-v2 architecture and tagged for safety use cases including prompt injection and jailbreak detection. It is part of the Llama 4 safety tooling ecosystem and supports English and French.

Frontier Model Releases AI Safety Research Hugging Face Llama Prompt Guard 2-86M DeBERTa-v3 +1 more

4arXiv · cs.CL·1mo ago·source ↗

Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs

This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

Enterprise Deployment Patterns Agent and Tool Ecosystem International Classification of Diseases (ICD)e5_large Bag of Words (BoW)+3 more

6Mistral Ai News·23d ago·source ↗

Mistral Saba: 24B Regional Language Model for Middle East and South Asia

Mistral AI has released Mistral Saba, a 24B parameter model specialized for Arabic and South Asian languages, with particular strength in South Indian languages such as Tamil. The model is trained on curated datasets from the Middle East and South Asia, and claims to outperform models more than 5x its size on regional tasks while running on single-GPU systems at over 150 tokens/second. It is available via API and for local on-premises deployment, targeting enterprise use cases in conversational support, domain-specific expertise, and cultural content creation. Mistral also announced a custom private model training offering for strategic enterprise customers.

Frontier Model Releases Inference Economics Tamil Mistral AI Mistral Small 4 +3 more

7Meta Llama·15d ago·source ↗

Meta releases Llama 3.2 11B Vision Instruct multimodal model

Meta released Llama 3.2 11B Vision Instruct on Hugging Face, an open-weights multimodal model supporting image-text-to-text tasks. The model is part of the Llama 3.2 family and supports English and German. With over 157K downloads and 1,600 likes, it has seen substantial community adoption.

Open Weights Progress Multimodal Progress Hugging Face Meta Llama 3.2 90B Vision-Instruct