benchmark

FarsTail

benchmarkactiveprovisionalfarstail-eb716de0·1 events·first seen 2d ago

Aliases: FarsTail

Co-occurring entities

Sepahr-Danesh RoBERTa PQuAD IHUBERT ParsiNLU-RC

More like this (12)

SlideTailor TailLoR Fara1.5 FADA Open Fable Fara-7B FineWeb FaraGen1.5 MiniF2F HullFT FTX FAST

Recent events (1)

3arXiv · cs.CL·2d ago·source ↗

IHUBERT: Persian RoBERTa-base model trained on 45GB semantically deduplicated corpus

Researchers introduce IHUBERT, a 125M-parameter monolingual Persian pretrained language model trained from scratch using the RoBERTa-base architecture on a 45GB curated subset of the Sepahr-Danesh collection (~7-8B tokens). The work features a multi-stage preprocessing pipeline including vector-database-based semantic deduplication for domain-balanced pretraining, and a 139k-vocabulary BPE tokenizer optimized for Persian morphology. IHUBERT is evaluated across seven Persian NLU benchmarks, achieving state-of-the-art results on extractive QA (PQuAD F1 88.35) and NLI (FarsTail Macro-F1 0.835). The paper contributes both a new model and a semantic deduplication methodology applicable to low-resource language pretraining.

Evaluation and Benchmarking Sepahr-Danesh RoBERTa PQuAD +3 more