benchmark
FarsTail
benchmarkactiveprovisional
farstail-eb716de0·1 events·first seen 2d agoAliases: FarsTail
Co-occurring entities
More like this (12)
Recent events (1)
IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus
Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.