4Hugging Face Blog·1mo ago

CO₂ Emissions and Model Performance: Insights from the Open LLM Leaderboard

Hugging Face published an analysis correlating CO₂ emissions with model performance across submissions to the Open LLM Leaderboard. The study examines the environmental cost of open-weight model development and inference, exploring efficiency trade-offs between model size, benchmark scores, and carbon footprint. The analysis provides empirical data to help researchers and practitioners evaluate sustainability alongside capability metrics.

Evaluation and Benchmarking Open Weights Progress Inference Economics Open LLM Leaderboard Hugging Face CO₂ emissions

Related guides (4)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost Structure of Running AI Models in Production

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Read asIn-depth

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

What's going on with the Open LLM Leaderboard?

Hugging Face published a commentary examining anomalies and issues observed in the Open LLM Leaderboard, focusing on MMLU benchmark results. The post investigates potential data contamination, evaluation inconsistencies, and scoring discrepancies across open-weight models. It raises concerns about the reliability of MMLU as a benchmark signal and the integrity of leaderboard rankings.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face MMLU

4Hugging Face Blog·1mo ago·source ↗

CO2 Emissions and the Hugging Face Hub: Leading the Charge

Hugging Face published a blog post outlining their approach to tracking and reporting carbon emissions for models hosted on the Hub. The initiative aims to surface CO2 metadata alongside model cards to promote transparency in AI environmental impact. This represents an early industry effort to standardize emissions reporting as part of model documentation practices.

Enterprise Deployment Patterns Model Cards Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

Hugging Face is hosting the Artificial Analysis LLM Performance Leaderboard, which tracks inference performance metrics such as latency, throughput, and cost across multiple LLM providers. The leaderboard provides a standardized comparison of how different models perform in production deployment contexts rather than purely capability benchmarks. This collaboration brings infrastructure and deployment performance data into the Hugging Face ecosystem.

Evaluation and Benchmarking Inference Economics Artificial Analysis Hugging Face Artificial Analysis LLM Performance Leaderboard +1 more

6Mistral Ai News·19d ago·source ↗

Mistral AI Publishes First Comprehensive Lifecycle Analysis of LLM Environmental Footprint

Mistral AI has released what it claims is the first comprehensive lifecycle analysis (LCA) of an AI model, conducted in collaboration with Carbone 4 and French agency ADEME, covering greenhouse gas emissions, water use, and resource depletion. Key findings include Mistral Large 2 generating 20.4 ktCO₂e, 281,000 m³ of water, and 660 kg Sb eq over 18 months of training and usage, with a single 400-token Le Chat inference costing 1.14 gCO₂e and 45 mL of water. The study proposes three standardized reporting indicators for the industry and advocates for mandatory disclosure of training and inference environmental impacts. Mistral argues model size correlates roughly linearly with environmental footprint, emphasizing the importance of right-sizing model selection.

Training Infrastructure Inference Economics Mistral AI Hubblo GHG Protocol Product Standard +9 more

5Hugging Face Blog·1mo ago·source ↗

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress PubMedQA Open Medical-LLM Leaderboard MedMCQA +3 more

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open FinLLM Leaderboard

Hugging Face has launched the Open FinLLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on financial domain tasks. The leaderboard aims to provide standardized, open evaluation of LLMs across finance-specific capabilities such as financial reasoning, document understanding, and numerical analysis. This fills a gap in domain-specific evaluation infrastructure for the financial sector.

Evaluation and Benchmarking Enterprise Deployment Patterns FinBench Open LLM Leaderboard Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Open LLM Leaderboard: DROP Deep Dive

Hugging Face published a detailed analysis of the DROP benchmark as used in the Open LLM Leaderboard, examining how models are evaluated on this reading comprehension and numerical reasoning task. The post investigates scoring methodology, potential issues with evaluation consistency, and what DROP results actually reveal about model capabilities. This is part of ongoing efforts to improve transparency and reliability of the Open LLM Leaderboard.

Evaluation and Benchmarking DROP Open LLM Leaderboard Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

Upstage and Hugging Face have launched the Open Ko-LLM Leaderboard, a public benchmark platform for evaluating large language models specifically on Korean language tasks. The leaderboard aims to standardize Korean LLM evaluation and foster competition among models targeting the Korean-language market. This initiative extends the Open LLM Leaderboard framework to a non-English language context, reflecting growing interest in multilingual and language-specific model evaluation.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face Upstage