4Hugging Face Blog·1mo ago

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.

Frontier Model Releases Evaluation and Benchmarking 3C3H AraGen Hugging Face

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

Hugging Face introduces new Arabic-language evaluation infrastructure, including an Arabic Instruction Following benchmark and updates to the AraGen leaderboard. The post covers evaluation methodology for Arabic LLM capabilities, expanding the ecosystem of non-English benchmarks. This is part of a broader effort to track model performance on Arabic language tasks beyond standard English-centric evaluations.

Evaluation and Benchmarking Open Weights Progress AraGen Hugging Face Open Arabic LLM Leaderboard +1 more

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Arabic LLM Leaderboard

Hugging Face has launched the Open Arabic LLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on Arabic language tasks. The leaderboard aims to fill a gap in multilingual evaluation infrastructure by providing standardized assessments for Arabic NLP capabilities. This initiative supports the open-source community in tracking progress on Arabic language understanding and generation.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Arabic LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

The Open Arabic LLM Leaderboard 2

Hugging Face has launched the second version of the Open Arabic LLM Leaderboard, a benchmarking platform for evaluating large language models on Arabic language tasks. The updated leaderboard introduces revised evaluation protocols and benchmarks targeting Arabic-specific capabilities. This initiative supports the open research community in tracking progress on Arabic NLP, a historically underserved language in LLM evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Arabic LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

3LM: A Benchmark for Arabic LLMs in STEM and Code

TII UAE has released 3LM, a benchmark designed to evaluate large language models on Arabic-language STEM and coding tasks. The benchmark addresses a gap in multilingual evaluation infrastructure, where Arabic has been underrepresented relative to English and other high-resource languages. It targets both general-purpose and Arabic-specialized LLMs to assess their capabilities in technical domains.

Evaluation and Benchmarking Open Weights Progress 3LM Hugging Face TII UAE

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM-as-a-Judge Judge Arena Hugging Face +2 more

5Hugging Face Blog·1mo ago·source ↗

QIMMA: A Quality-First Arabic LLM Leaderboard

TII UAE (Technology Innovation Institute) has launched QIMMA, a leaderboard specifically designed to evaluate large language models on Arabic language tasks with a focus on quality-first assessment. The leaderboard aims to address gaps in Arabic NLP evaluation by providing standardized benchmarks tailored to Arabic linguistic characteristics. This represents a dedicated infrastructure effort for tracking Arabic LLM progress, a historically underserved language in evaluation frameworks.

Evaluation and Benchmarking Open Weights Progress Hugging Face QIMMA Technology Innovation Institute

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Leaderboard for Hebrew LLMs

Hugging Face has launched an open leaderboard dedicated to evaluating large language models on Hebrew language tasks. The leaderboard aims to benchmark multilingual and Hebrew-specific models across standardized tasks to track progress in Hebrew NLP. This fills a gap in non-English language evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Hebrew LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Leaderboard for Japanese LLMs

Hugging Face has launched an open leaderboard specifically for evaluating large language models on Japanese language tasks. The leaderboard aims to provide standardized benchmarking for Japanese LLMs, filling a gap in multilingual evaluation infrastructure. This initiative supports the growing ecosystem of Japanese-language AI development and open evaluation practices.

Evaluation and Benchmarking Open Weights Progress Open Leaderboard for Japanese LLMs Hugging Face