An Introduction to AI Secure LLM Safety Leaderboard
Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.
Related guides (3)
Related events (8)
Introducing the Open Leaderboard for Japanese LLMs
Hugging Face has launched an open leaderboard specifically for evaluating large language models on Japanese language tasks. The leaderboard aims to provide standardized benchmarking for Japanese LLMs, filling a gap in multilingual evaluation infrastructure. This initiative supports the growing ecosystem of Japanese-language AI development and open evaluation practices.
The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare
Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.
Introducing the Open FinLLM Leaderboard
Hugging Face has launched the Open FinLLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on financial domain tasks. The leaderboard aims to provide standardized, open evaluation of LLMs across finance-specific capabilities such as financial reasoning, document understanding, and numerical analysis. This fills a gap in domain-specific evaluation infrastructure for the financial sector.
Introducing the Open Arabic LLM Leaderboard
Hugging Face has launched the Open Arabic LLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on Arabic language tasks. The leaderboard aims to fill a gap in multilingual evaluation infrastructure by providing standardized assessments for Arabic NLP capabilities. This initiative supports the open-source community in tracking progress on Arabic language understanding and generation.
Judge Arena: Benchmarking LLMs as Evaluators
Hugging Face and Atla have launched Judge Arena, a platform for benchmarking large language models in their role as automated evaluators. The initiative uses an Elo-based ranking system to compare how well different LLMs judge the quality of model outputs, addressing the growing reliance on LLM-as-judge paradigms in evaluation pipelines. This fills a meta-evaluation gap: as LLM judges become standard practice, understanding their relative reliability and biases becomes critical infrastructure for the field.
Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face
Hugging Face is hosting the Artificial Analysis LLM Performance Leaderboard, which tracks inference performance metrics such as latency, throughput, and cost across multiple LLM providers. The leaderboard provides a standardized comparison of how different models perform in production deployment contexts rather than purely capability benchmarks. This collaboration brings infrastructure and deployment performance data into the Hugging Face ecosystem.
What's going on with the Open LLM Leaderboard?
Hugging Face published a commentary examining anomalies and issues observed in the Open LLM Leaderboard, focusing on MMLU benchmark results. The post investigates potential data contamination, evaluation inconsistencies, and scoring discrepancies across open-weight models. It raises concerns about the reliability of MMLU as a benchmark signal and the integrity of leaderboard rankings.
The Open Arabic LLM Leaderboard 2
Hugging Face has launched the second version of the Open Arabic LLM Leaderboard, a benchmarking platform for evaluating large language models on Arabic language tasks. The updated leaderboard introduces revised evaluation protocols and benchmarks targeting Arabic-specific capabilities. This initiative supports the open research community in tracking progress on Arabic NLP, a historically underserved language in LLM evaluation infrastructure.


