Almanac
← Events
5Hugging Face Blog·1mo ago

Fixing Open LLM Leaderboard with Math-Verify

Hugging Face introduces Math-Verify, a tool designed to address evaluation reliability issues in the Open LLM Leaderboard by improving mathematical answer verification. The post describes problems with existing string-matching approaches that lead to inaccurate benchmark scores for math tasks. Math-Verify aims to provide more robust symbolic and numerical answer checking to reduce false positives and negatives in leaderboard evaluations.

Related guides (2)

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

What's going on with the Open LLM Leaderboard?

Hugging Face published a commentary examining anomalies and issues observed in the Open LLM Leaderboard, focusing on MMLU benchmark results. The post investigates potential data contamination, evaluation inconsistencies, and scoring discrepancies across open-weight models. It raises concerns about the reliability of MMLU as a benchmark signal and the integrity of leaderboard rankings.

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open FinLLM Leaderboard

Hugging Face has launched the Open FinLLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on financial domain tasks. The leaderboard aims to provide standardized, open evaluation of LLMs across finance-specific capabilities such as financial reasoning, document understanding, and numerical analysis. This fills a gap in domain-specific evaluation infrastructure for the financial sector.

5Hugging Face Blog·1mo ago·source ↗

An Introduction to AI Secure LLM Safety Leaderboard

Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.

6arXiv · cs.CL·22d ago·source ↗

Resolution Diagnostics for Paired LLM Evaluation: Many Leaderboard Rankings Statistically Unresolved

This paper frames pairwise LLM evaluation as a hypothesis-testing problem and introduces a resolution ratio q = N/N* to diagnose whether leaderboard comparisons are statistically powered. Applying this to two public leaderboards, the authors find that 11/40 Open LLM Leaderboard v1 pairwise comparisons and 4-6/9 MMLU-Pro top-10 adjacent-rank pairs fail to meet conventional (alpha=0.05, power=0.8) resolution targets. A key finding is that the widely-used unpaired Cohen-h shortcut underestimates required sample size by approximately a factor of two in close-comparison regimes, a flaw silently inherited by three major statistical calculators. The unresolved-pair pattern persists under multiplicity correction and sequential testing.

4Hugging Face Blog·1mo ago·source ↗

Open LLM Leaderboard: DROP Deep Dive

Hugging Face published a detailed analysis of the DROP benchmark as used in the Open LLM Leaderboard, examining how models are evaluated on this reading comprehension and numerical reasoning task. The post investigates scoring methodology, potential issues with evaluation consistency, and what DROP results actually reveal about model capabilities. This is part of ongoing efforts to improve transparency and reliability of the Open LLM Leaderboard.

8arXiv · cs.AI·29d ago·source ↗

Large-Scale Evaluation of LLM-Driven Formal Proof Search on Open Mathematical Problems

Researchers present the first large-scale evaluation of LLM-based formal proof search on genuinely open mathematical problems, using Lean as a verification backend. Their most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The study also compares agent architectures, finding that more sophisticated designs outperform simple generate-and-verify loops on the hardest problems.

4Hugging Face Blog·1mo ago·source ↗

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

Hugging Face is hosting the Artificial Analysis LLM Performance Leaderboard, which tracks inference performance metrics such as latency, throughput, and cost across multiple LLM providers. The leaderboard provides a standardized comparison of how different models perform in production deployment contexts rather than purely capability benchmarks. This collaboration brings infrastructure and deployment performance data into the Hugging Face ecosystem.

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Leaderboard for Hebrew LLMs

Hugging Face has launched an open leaderboard dedicated to evaluating large language models on Hebrew language tasks. The leaderboard aims to benchmark multilingual and Hebrew-specific models across standardized tasks to track progress in Hebrew NLP. This fills a gap in non-English language evaluation infrastructure.