4Hugging Face Blog·1mo ago

BenCzechMark: A Benchmark for Evaluating LLM Czech Language Understanding

BenCzechMark is a new evaluation benchmark designed to assess large language model performance on Czech language tasks. The benchmark addresses the gap in non-English language evaluation, providing a structured way to measure LLM capabilities in Czech across multiple task types. Published on Hugging Face, it contributes to the growing ecosystem of multilingual and language-specific benchmarks.

Evaluation and Benchmarking Hugging Face BenCzechMark

Related guides (2)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

FilBench: Benchmarking LLM Capabilities in Filipino Language

FilBench is a new benchmark introduced to evaluate large language models on their ability to understand and generate Filipino. The benchmark targets a historically underrepresented language in NLP evaluation suites, assessing both comprehension and generation tasks. This work addresses gaps in multilingual LLM evaluation coverage, particularly for Southeast Asian languages.

Evaluation and Benchmarking Multimodal Progress FilBench Hugging Face Filipino

4Hugging Face Blog·1mo ago·source ↗

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.

Frontier Model Releases Evaluation and Benchmarking 3C3H AraGen Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Judge Arena: Benchmarking LLMs as Evaluators

Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.

Evaluation and Benchmarking Agent and Tool Ecosystem LLM-as-a-Judge Judge Arena Hugging Face +2 more

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Leaderboard for Hebrew LLMs

Hugging Face has launched an open leaderboard dedicated to evaluating large language models on Hebrew language tasks. The leaderboard aims to benchmark multilingual and Hebrew-specific models across standardized tasks to track progress in Hebrew NLP. This fills a gap in non-English language evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Hebrew LLM Leaderboard

3arXiv · cs.LG·8d ago·source ↗

SkMTEB: First comprehensive MTEB-style text embedding benchmark for Slovak with adapted E5 models

Researchers introduce SkMTEB, the first MTEB-style embedding benchmark for Slovak, covering 31 datasets across 7 task types — roughly 4× the existing multilingual benchmark coverage for the language. Evaluation of 31 embedding models shows large instruction-tuned multilingual models outperform Slovak-specific NLU models on embedding tasks. The authors also release e5-sk-small (45M) and e5-sk-large (365M), derived from Multilingual E5 via vocabulary trimming and fine-tuning, achieving competitive performance with proprietary APIs at up to 62% size reduction.

Evaluation and Benchmarking Open Weights Progress MTEB SkMTEB e5_large +2 more

4arXiv · cs.CL·12d ago·source ↗

Phun-Bench: A Chinese benchmark for evaluating LLM phonological understanding

Researchers introduce Phun-Bench, a purpose-built benchmark for evaluating LLMs on phonological understanding in Chinese across three dimensions: Homophony, Rhyme, and Phonetic Similarity. The benchmark is designed to avoid rote-memorization shortcuts that plague existing phonological evals. Results show LLMs can recall correct pronunciations but fail to apply phonological knowledge flexibly as human speakers do, and the authors propose a hypothesis about the underlying mechanism of LLM phonological 'perception'.

Evaluation and Benchmarking Phun-Bench

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

Upstage and Hugging Face have launched the Open Ko-LLM Leaderboard, a public benchmark platform for evaluating large language models specifically on Korean language tasks. The leaderboard aims to standardize Korean LLM evaluation and foster competition among models targeting the Korean-language market. This initiative extends the Open LLM Leaderboard framework to a non-English language context, reflecting growing interest in multilingual and language-specific model evaluation.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face Upstage