Guide to Setting Up a Hugging Face Leaderboard: Vectara Hallucination Leaderboard as Example
This Hugging Face blog post provides an end-to-end tutorial on creating custom leaderboards on the Hugging Face platform, using Vectara's hallucination leaderboard as a concrete example. It covers the technical setup process for hosting evaluation leaderboards, which are increasingly important infrastructure for tracking model capabilities. The post bridges tooling and evaluation concerns by showing how third-party organizations can publish standardized benchmarks on HF.
Related guides (3)
Related events (8)
The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models
Hugging Face has launched an open leaderboard specifically designed to benchmark hallucination rates across large language models. The effort aims to standardize evaluation of factual accuracy and confabulation tendencies, filling a gap in existing benchmarks that focus primarily on capability rather than reliability. The leaderboard is positioned as a community-driven, transparent resource for tracking model trustworthiness.
Object Detection Leaderboard on Hugging Face
Hugging Face has launched an object detection leaderboard to benchmark and compare models on standard detection tasks. The leaderboard provides a centralized evaluation platform for tracking progress in object detection across the community. This follows the pattern of Hugging Face expanding its evaluation infrastructure for specific ML subdomains.
Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face
Hugging Face is hosting the Artificial Analysis LLM Performance Leaderboard, which tracks inference performance metrics such as latency, throughput, and cost across multiple LLM providers. The leaderboard provides a standardized comparison of how different models perform in production deployment contexts rather than purely capability benchmarks. This collaboration brings infrastructure and deployment performance data into the Hugging Face ecosystem.
Community Evals: Because we're done trusting black-box leaderboards over the community
Hugging Face introduces Community Evals, a framework aimed at replacing or supplementing opaque black-box leaderboards with community-driven model evaluations. The initiative reflects growing skepticism about the reliability and transparency of existing benchmark leaderboards. By crowdsourcing evaluations, Hugging Face seeks to make model assessment more transparent, diverse, and resistant to gaming. This represents a structural shift in how the open-source AI community approaches model comparison and trust.
Launching the Artificial Analysis Text to Image Leaderboard & Arena
Hugging Face and Artificial Analysis are launching a combined leaderboard and arena for evaluating text-to-image models. The leaderboard tracks quality, speed, and cost metrics across leading image generation models, while the arena component collects human preference votes for side-by-side comparisons. This provides a structured benchmark for comparing commercial and open-weight image generation systems.
Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs
Hugging Face introduces a leaderboard based on LiveCodeBench, a benchmark designed for holistic and contamination-free evaluation of code-generating large language models. The benchmark continuously collects new coding problems from competitive programming platforms to prevent data contamination that plagues static benchmarks. It evaluates models across multiple code-related tasks beyond just code generation, aiming to provide a more reliable signal of true model capability.
Introducing the Open Leaderboard for Hebrew LLMs
Hugging Face has launched an open leaderboard dedicated to evaluating large language models on Hebrew language tasks. The leaderboard aims to benchmark multilingual and Hebrew-specific models across standardized tasks to track progress in Hebrew NLP. This fills a gap in non-English language evaluation infrastructure.
Adding Benchmaxxer Repellant to the Open ASR Leaderboard
Hugging Face describes measures taken to prevent benchmark gaming ('benchmaxxing') on the Open ASR Leaderboard by introducing private or held-out evaluation data. The post addresses the integrity of automatic speech recognition benchmarks, where models may be overfitted or tuned specifically to public test sets. This is part of a broader effort to maintain meaningful leaderboard rankings as ASR model submissions increase.


