3Hugging Face Blog·1mo ago

Guide to Setting Up a Hugging Face Leaderboard: Vectara Hallucination Leaderboard as Example

This Hugging Face blog post provides an end-to-end tutorial on creating custom leaderboards on the Hugging Face platform, using Vectara's hallucination leaderboard as a concrete example. It covers the technical setup process for hosting evaluation leaderboards, which are increasingly important infrastructure for tracking model capabilities. The post bridges tooling and evaluation concerns by showing how third-party organizations can publish standardized benchmarks on HF.

Evaluation and Benchmarking Agent and Tool Ecosystem Vectara Hugging Face Hugging Face Leaderboard Vectara Hallucination Leaderboard

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Hugging Face has launched an open leaderboard specifically designed to benchmark hallucination rates across large language models. The effort aims to standardize evaluation of factual accuracy and confabulation tendencies, filling a gap in existing benchmarks that focus primarily on capability rather than reliability. The leaderboard is positioned as a community-driven, transparent resource for tracking model trustworthiness.

Evaluation and Benchmarking AI Safety Research Hugging Face Hallucinations Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Object Detection Leaderboard on Hugging Face

Hugging Face has launched an object detection leaderboard to benchmark and compare models on standard detection tasks. The leaderboard provides a centralized evaluation platform for tracking progress in object detection across the community. This follows the pattern of Hugging Face expanding its evaluation infrastructure for specific ML subdomains.

Evaluation and Benchmarking Hugging Face Object Detection Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

Hugging Face is hosting the Artificial Analysis LLM Performance Leaderboard, which tracks inference performance metrics such as latency, throughput, and cost across multiple LLM providers. The leaderboard provides a standardized comparison of how different models perform in production deployment contexts rather than purely capability benchmarks. This collaboration brings infrastructure and deployment performance data into the Hugging Face ecosystem.

Evaluation and Benchmarking Inference Economics Artificial Analysis Hugging Face Artificial Analysis LLM Performance Leaderboard +1 more

5Hugging Face Blog·1mo ago·source ↗

Community Evals: Because we're done trusting black-box leaderboards over the community

Hugging Face introduces Community Evals, a framework aimed at replacing or supplementing opaque black-box leaderboards with community-driven model evaluations. The initiative reflects growing skepticism about the reliability and transparency of existing benchmark leaderboards. By crowdsourcing evaluations, Hugging Face seeks to make model assessment more transparent, diverse, and resistant to gaming. This represents a structural shift in how the open-source AI community approaches model comparison and trust.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face Community Evals

5Hugging Face Blog·1mo ago·source ↗

Launching the Artificial Analysis Text to Image Leaderboard & Arena

Hugging Face and Artificial Analysis are launching a combined leaderboard and arena for evaluating text-to-image models. The leaderboard tracks quality, speed, and cost metrics across leading image generation models, while the arena component collects human preference votes for side-by-side comparisons. This provides a structured benchmark for comparing commercial and open-weight image generation systems.

Evaluation and Benchmarking Inference Economics Artificial Analysis Artificial Analysis Text to Image Leaderboard Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Hugging Face introduces a leaderboard based on LiveCodeBench, a benchmark designed for holistic and contamination-free evaluation of code-generating large language models. The benchmark continuously collects new coding problems from competitive programming platforms to prevent data contamination that plagues static benchmarks. It evaluates models across multiple code-related tasks beyond just code generation, aiming to provide a more reliable signal of true model capability.

Evaluation and Benchmarking Agent and Tool Ecosystem LiveCodeBench Hugging Face LiveCodeBench Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Leaderboard for Hebrew LLMs

Hugging Face has launched an open leaderboard dedicated to evaluating large language models on Hebrew language tasks. The leaderboard aims to benchmark multilingual and Hebrew-specific models across standardized tasks to track progress in Hebrew NLP. This fills a gap in non-English language evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress Hugging Face Open Hebrew LLM Leaderboard

4Hugging Face Blog·1mo ago·source ↗

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Hugging Face describes measures taken to prevent benchmark gaming ('benchmaxxing') on the Open ASR Leaderboard by introducing private or held-out evaluation data. The post addresses the integrity of automatic speech recognition benchmarks, where models may be overfitted or tuned specifically to public test sets. This is part of a broader effort to maintain meaningful leaderboard rankings as ASR model submissions increase.

Evaluation and Benchmarking Open ASR Leaderboard benchmaxxing Hugging Face