Call Me Almanac

4Hugging Face Blog·1mo ago

3LM: A Benchmark for Arabic LLMs in STEM and Code

TII UAE has released 3LM, a benchmark designed to evaluate large language models on Arabic-language STEM and coding tasks. The benchmark addresses a gap in multilingual evaluation infrastructure, where Arabic has been underrepresented relative to English and other high-resource languages. It targets both general-purpose and Arabic-specialized LLMs to assess their capabilities in technical domains.

Evaluation and Benchmarking Open Weights Progress 3LM Hugging Face TII UAE

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

QIMMA: A Quality-First Arabic LLM Leaderboard

TII UAE (Technology Innovation Institute) has launched QIMMA, a leaderboard specifically designed to evaluate large language models on Arabic language tasks with a focus on quality-first assessment. The leaderboard aims to address gaps in Arabic NLP evaluation by providing standardized benchmarks tailored to Arabic linguistic characteristics. This represents a dedicated infrastructure effort for tracking Arabic LLM progress, a historically underserved language in evaluation frameworks.

Evaluation and Benchmarking Open Weights Progress Hugging Face QIMMA Technology Innovation Institute

4Hugging Face Blog·1mo ago·source ↗

The Open Arabic LLM Leaderboard 2

Hugging Face has launched the second version of the Open Arabic LLM Leaderboard, a benchmarking platform for evaluating large language models on Arabic language tasks. The updated leaderboard introduces revised evaluation protocols and benchmarks targeting Arabic-specific capabilities. This initiative supports the open research community in tracking progress on Arabic NLP, a historically underserved language in LLM evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Arabic LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Arabic LLM Leaderboard

Hugging Face has launched the Open Arabic LLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on Arabic language tasks. The leaderboard aims to fill a gap in multilingual evaluation infrastructure by providing standardized assessments for Arabic NLP capabilities. This initiative supports the open-source community in tracking progress on Arabic language understanding and generation.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Arabic LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.

Frontier Model Releases Evaluation and Benchmarking 3C3H AraGen Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Alyah: Benchmark for Evaluating Emirati Dialect Capabilities in Arabic LLMs

TII UAE introduces Alyah, a benchmark designed to evaluate large language models on Emirati Arabic dialect understanding and generation. The work addresses a gap in Arabic NLP evaluation, where most benchmarks focus on Modern Standard Arabic and neglect regional dialects. The benchmark aims to provide robust assessment of LLM capabilities specific to Emirati linguistic and cultural context.

Frontier Model Releases Evaluation and Benchmarking Emirati Arabic Arabic LLMs Technology Innovation Institute +1 more

4Hugging Face Blog·1mo ago·source ↗

FilBench: Benchmarking LLM Capabilities in Filipino Language

FilBench is a new benchmark introduced to evaluate large language models on their ability to understand and generate Filipino. The benchmark targets a historically underrepresented language in NLP evaluation suites, assessing both comprehension and generation tasks. This work addresses gaps in multilingual LLM evaluation coverage, particularly for Southeast Asian languages.

Evaluation and Benchmarking Multimodal Progress FilBench Hugging Face Filipino

4Hugging Face Blog·1mo ago·source ↗

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

Hugging Face introduces new Arabic-language evaluation infrastructure, including an Arabic Instruction Following benchmark and updates to the AraGen leaderboard. The post covers evaluation methodology for Arabic LLM capabilities, expanding the ecosystem of non-English benchmarks. This is part of a broader effort to track model performance on Arabic language tasks beyond standard English-centric evaluations.

Evaluation and Benchmarking Open Weights Progress AraGen Hugging Face Open Arabic LLM Leaderboard +1 more

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM-as-a-Judge Judge Arena Hugging Face +2 more