ArabiGEE: Hierarchical taxonomy for Arabic grammatical error explanation with LLM evaluation support
Researchers introduce ArabiGEE, the first structured taxonomy for Arabic grammatical error explanation (GEE), organizing 27 error types, 140 correction types, and 324 explanations across orthographic, morphological, syntactic, and lexical dimensions. The taxonomy is applied to annotate existing Arabic grammatical error correction corpora and is used to benchmark LLMs on Arabic GEE tasks. The work addresses a gap in Arabic NLP tooling by moving beyond free-form explanation generation toward structured, evaluable outputs.
Related guides (1)
Related events (8)
Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard
Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.
Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More
Hugging Face introduces new Arabic-language evaluation infrastructure, including an Arabic Instruction Following benchmark and updates to the AraGen leaderboard. The post covers evaluation methodology for Arabic LLM capabilities, expanding the ecosystem of non-English benchmarks. This is part of a broader effort to track model performance on Arabic language tasks beyond standard English-centric evaluations.
3LM: A Benchmark for Arabic LLMs in STEM and Code
TII UAE has released 3LM, a benchmark designed to evaluate large language models on Arabic-language STEM and coding tasks. The benchmark addresses a gap in multilingual evaluation infrastructure, where Arabic has been underrepresented relative to English and other high-resource languages. It targets both general-purpose and Arabic-specialized LLMs to assess their capabilities in technical domains.
The Open Arabic LLM Leaderboard 2
Hugging Face has launched the second version of the Open Arabic LLM Leaderboard, a benchmarking platform for evaluating large language models on Arabic language tasks. The updated leaderboard introduces revised evaluation protocols and benchmarks targeting Arabic-specific capabilities. This initiative supports the open research community in tracking progress on Arabic NLP, a historically underserved language in LLM evaluation infrastructure.
Introducing the Open Arabic LLM Leaderboard
Hugging Face has launched the Open Arabic LLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on Arabic language tasks. The leaderboard aims to fill a gap in multilingual evaluation infrastructure by providing standardized assessments for Arabic NLP capabilities. This initiative supports the open-source community in tracking progress on Arabic language understanding and generation.
Study: LLM-Derived Error Highlights and APE Suggestions in MT Post-Editing
Researchers conducted a controlled study with professional En-Nl translators comparing post-editing (PE) workflows augmented with LLM-derived error highlights and automatic post-editing (APE) correction suggestions against regular PE and QE-derived highlights. No condition produced measurable productivity or quality gains over standard PE. However, APE-derived highlights were preferred over QE-derived highlights, and correction suggestions improved subjective user experience.
Alyah: Benchmark for Evaluating Emirati Dialect Capabilities in Arabic LLMs
TII UAE introduces Alyah, a benchmark designed to evaluate large language models on Emirati Arabic dialect understanding and generation. The work addresses a gap in Arabic NLP evaluation, where most benchmarks focus on Modern Standard Arabic and neglect regional dialects. The benchmark aims to provide robust assessment of LLM capabilities specific to Emirati linguistic and cultural context.
QIMMA: A Quality-First Arabic LLM Leaderboard
TII UAE (Technology Innovation Institute) has launched QIMMA, a leaderboard specifically designed to evaluate large language models on Arabic language tasks with a focus on quality-first assessment. The leaderboard aims to address gaps in Arabic NLP evaluation by providing standardized benchmarks tailored to Arabic linguistic characteristics. This represents a dedicated infrastructure effort for tracking Arabic LLM progress, a historically underserved language in evaluation frameworks.
