5Hugging Face Blog·1mo ago

Community Evals: Because we're done trusting black-box leaderboards over the community

Hugging Face introduces Community Evals, a framework aimed at replacing or supplementing opaque black-box leaderboards with community-driven model evaluations. The initiative reflects growing skepticism about the reliability and transparency of existing benchmark leaderboards. By crowdsourcing evaluations, Hugging Face seeks to make model assessment more transparent, diverse, and resistant to gaming. This represents a structural shift in how the open-source AI community approaches model comparison and trust.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face Community Evals

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Announcing Evaluation on the Hub

Hugging Face announced Evaluation on the Hub, a new feature enabling users to evaluate any model on any dataset directly within the Hugging Face Hub infrastructure. The tool aims to lower the barrier to standardized model evaluation by integrating evaluation workflows into the existing model and dataset hosting platform. This represents an infrastructure step toward more accessible and reproducible benchmarking in the ML community.

Evaluation and Benchmarking Agent and Tool Ecosystem Evaluation on the Hub Hugging Face

6arXiv · cs.CL·5d ago·source ↗

Every Eval Ever: unified schema and community repository for AI evaluation results

Researchers introduce Every Eval Ever, a shared schema and crowdsourced repository designed to standardize AI evaluation results across incompatible formats, frameworks, and sources. The system ingests results from evaluation harnesses, papers, leaderboards, and custom repositories into a single JSON document format, with optional per-instance output storage. The repository, hosted on Hugging Face, currently covers 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats. The work addresses a persistent infrastructure problem in AI evaluation science: divergent scores for nominally identical evaluations and scattered, incomparable metadata.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face Every Eval Ever

5Hugging Face Blog·1mo ago·source ↗

BigCodeBench: The Next Generation of HumanEval

Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.

Evaluation and Benchmarking Agent and Tool Ecosystem BigCodeBench Hugging Face HumanEval

5Hugging Face Blog·1mo ago·source ↗

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Hugging Face introduces a leaderboard based on LiveCodeBench, a benchmark designed for holistic and contamination-free evaluation of code-generating large language models. The benchmark continuously collects new coding problems from competitive programming platforms to prevent data contamination that plagues static benchmarks. It evaluates models across multiple code-related tasks beyond just code generation, aiming to provide a more reliable signal of true model capability.

Evaluation and Benchmarking Agent and Tool Ecosystem LiveCodeBench Hugging Face LiveCodeBench Leaderboard

5Hugging Face Blog·1mo ago·source ↗

Introducing the Red-Teaming Resistance Leaderboard

Hugging Face and Haize Labs have launched a Red-Teaming Resistance Leaderboard to systematically benchmark how well AI models resist adversarial prompting and jailbreak attempts. The leaderboard provides a standardized evaluation framework for comparing model robustness against red-teaming attacks. This fills a gap in the evaluation ecosystem where safety and adversarial robustness metrics have been less formalized than capability benchmarks.

Evaluation and Benchmarking AI Safety Research Haize Labs Hugging Face Red-Teaming Resistance Leaderboard

3Hugging Face Blog·1mo ago·source ↗

Guide to Setting Up a Hugging Face Leaderboard: Vectara Hallucination Leaderboard as Example

This Hugging Face blog post provides an end-to-end tutorial on creating custom leaderboards on the Hugging Face platform, using Vectara's hallucination leaderboard as a concrete example. It covers the technical setup process for hosting evaluation leaderboards, which are increasingly important infrastructure for tracking model capabilities. The post bridges tooling and evaluation concerns by showing how third-party organizations can publish standardized benchmarks on HF.

Evaluation and Benchmarking Agent and Tool Ecosystem Vectara Hugging Face Hugging Face Leaderboard +1 more

4Hugging Face Blog·1mo ago·source ↗

Object Detection Leaderboard on Hugging Face

Hugging Face has launched an object detection leaderboard to benchmark and compare models on standard detection tasks. The leaderboard provides a centralized evaluation platform for tracking progress in object detection across the community. This follows the pattern of Hugging Face expanding its evaluation infrastructure for specific ML subdomains.

Evaluation and Benchmarking Hugging Face Object Detection Leaderboard

6arXiv · cs.AI·11d ago·source ↗

EvalCards: A unified reporting layer for AI evaluation results with interpretive signals

Researchers introduce EvalCards, an operational schema and tooling layer that composes benchmark metadata, evaluation run data, and model metadata into a unified, interpretable record for AI evaluation reporting. The system derives a reporting schema from 52 papers and 10 stakeholder interviews, implements four interpretive signals (reproducibility, documentation completeness, provenance/risk, score comparability), and deploys a monitoring tool across 5,816 models, 635 benchmarks, and 101,843 results. The work targets the widespread inconsistency in how evaluation results are reported across leaderboards, model cards, and company blogs, making cross-source comparison unreliable. It addresses a structural gap in the evaluation ecosystem by providing extraction infrastructure, not just a proposal.

Evaluation and Benchmarking AI Safety Research Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting EvalCards