4Hugging Face Blog·1mo ago

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

Hugging Face introduces new Arabic-language evaluation infrastructure, including an Arabic Instruction Following benchmark and updates to the AraGen leaderboard. The post covers evaluation methodology for Arabic LLM capabilities, expanding the ecosystem of non-English benchmarks. This is part of a broader effort to track model performance on Arabic language tasks beyond standard English-centric evaluations.

Evaluation and Benchmarking Open Weights Progress AraGen Hugging Face Open Arabic LLM Leaderboard Arabic Instruction Following Eval (IFEval)

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

The Open Arabic LLM Leaderboard 2

Hugging Face has launched the second version of the Open Arabic LLM Leaderboard, a benchmarking platform for evaluating large language models on Arabic language tasks. The updated leaderboard introduces revised evaluation protocols and benchmarks targeting Arabic-specific capabilities. This initiative supports the open research community in tracking progress on Arabic NLP, a historically underserved language in LLM evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Arabic LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Arabic LLM Leaderboard

Hugging Face has launched the Open Arabic LLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on Arabic language tasks. The leaderboard aims to fill a gap in multilingual evaluation infrastructure by providing standardized assessments for Arabic NLP capabilities. This initiative supports the open-source community in tracking progress on Arabic language understanding and generation.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Arabic LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.

Frontier Model Releases Evaluation and Benchmarking 3C3H AraGen Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Leaderboard for Hebrew LLMs

Hugging Face has launched an open leaderboard dedicated to evaluating large language models on Hebrew language tasks. The leaderboard aims to benchmark multilingual and Hebrew-specific models across standardized tasks to track progress in Hebrew NLP. This fills a gap in non-English language evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Hebrew LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Hugging Face has updated its Open ASR Leaderboard to include new multilingual and long-form audio transcription evaluation tracks. The post analyzes trends across submitted automatic speech recognition models, providing comparative benchmarking data across languages and extended audio contexts. This expands the leaderboard's coverage beyond English short-form ASR to better reflect real-world deployment scenarios.

Evaluation and Benchmarking Multimodal Progress Open ASR Leaderboard Automatic Speech Recognition Hugging Face

5Hugging Face Blog·1mo ago·source ↗

QIMMA: A Quality-First Arabic LLM Leaderboard

TII UAE (Technology Innovation Institute) has launched QIMMA, a leaderboard specifically designed to evaluate large language models on Arabic language tasks with a focus on quality-first assessment. The leaderboard aims to address gaps in Arabic NLP evaluation by providing standardized benchmarks tailored to Arabic linguistic characteristics. This represents a dedicated infrastructure effort for tracking Arabic LLM progress, a historically underserved language in evaluation frameworks.

Evaluation and Benchmarking Open Weights Progress Hugging Face QIMMA Technology Innovation Institute

4Hugging Face Blog·1mo ago·source ↗

3LM: A Benchmark for Arabic LLMs in STEM and Code

TII UAE has released 3LM, a benchmark designed to evaluate large language models on Arabic-language STEM and coding tasks. The benchmark addresses a gap in multilingual evaluation infrastructure, where Arabic has been underrepresented relative to English and other high-resource languages. It targets both general-purpose and Arabic-specialized LLMs to assess their capabilities in technical domains.

Evaluation and Benchmarking Open Weights Progress 3LM Hugging Face TII UAE

4Hugging Face Blog·1mo ago·source ↗

Alyah: Benchmark for Evaluating Emirati Dialect Capabilities in Arabic LLMs

TII UAE introduces Alyah, a benchmark designed to evaluate large language models on Emirati Arabic dialect understanding and generation. The work addresses a gap in Arabic NLP evaluation, where most benchmarks focus on Modern Standard Arabic and neglect regional dialects. The benchmark aims to provide robust assessment of LLM capabilities specific to Emirati linguistic and cultural context.

Frontier Model Releases Evaluation and Benchmarking Emirati Arabic Arabic LLMs Technology Innovation Institute +1 more