5Hugging Face Blog·1mo ago

NPHardEval Leaderboard: Benchmarking LLM Reasoning via Computational Complexity Classes

The NPHardEval leaderboard evaluates large language models on reasoning tasks drawn from computational complexity classes (P, NP, NP-Hard), providing a structured framework for assessing algorithmic reasoning capabilities. The benchmark uses dynamic problem updates to mitigate data contamination, a persistent challenge in static benchmarks. Results are hosted on Hugging Face and aim to reveal systematic differences in how frontier models handle problems of varying computational hardness.

Frontier Model Releases Evaluation and Benchmarking Hugging Face NPHardEval NP-Hard

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

9Openai Blog·1mo ago·source ↗

Learning to Reason with LLMs

OpenAI announced a new model or capability focused on reasoning in large language models, published on September 12, 2024. The post, hosted on the OpenAI blog, describes advances in training LLMs to perform complex multi-step reasoning. This likely corresponds to the release of the o1 (formerly 'Strawberry') model series, which uses chain-of-thought reasoning trained via reinforcement learning to achieve significantly improved performance on math, science, and coding benchmarks.

Frontier Model Releases Evaluation and Benchmarking Chain-of-Thought Reasoning Reinforcement Learning OpenAI +3 more

4Hugging Face Blog·1mo ago·source ↗

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.

Frontier Model Releases Evaluation and Benchmarking 3C3H AraGen Hugging Face

5Hugging Face Blog·1mo ago·source ↗

An Introduction to AI Secure LLM Safety Leaderboard

Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.

Evaluation and Benchmarking AI Safety Research LLM Safety Leaderboard Hugging Face DecodingTrust

5arXiv · cs.AI·12d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?How reliable are LLMs when it comes to playing dice?

6arXiv · cs.CL·22d ago·source ↗

Resolution Diagnostics for Paired LLM Evaluation: Many Leaderboard Rankings Statistically Unresolved

This paper frames pairwise LLM evaluation as a hypothesis-testing problem and introduces a resolution ratio q = N/N* to diagnose whether leaderboard comparisons are statistically powered. Applying this to two public leaderboards, the authors find that 11/40 Open LLM Leaderboard v1 pairwise comparisons and 4-6/9 MMLU-Pro top-10 adjacent-rank pairs fail to meet conventional (alpha=0.05, power=0.8) resolution targets. A key finding is that the widely-used unpaired Cohen-h shortcut underestimates required sample size by approximately a factor of two in close-comparison regimes, a flaw silently inherited by three major statistical calculators. The unresolved-pair pattern persists under multiplicity correction and sequential testing.

Frontier Model Releases Evaluation and Benchmarking Cohen's h G*Power resolution ratio q +3 more

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more

5Hugging Face Blog·1mo ago·source ↗

BigCodeBench: The Next Generation of HumanEval

Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.

Evaluation and Benchmarking Agent and Tool Ecosystem BigCodeBench Hugging Face HumanEval

6arXiv · cs.CL·1mo ago·source ↗

GIM: A Grounded Integration Measure Benchmark for Evaluating Multi-Domain Cognitive Coordination in LLMs

The Grounded Integration Measure (GIM) is a new LLM benchmark of 820 original problems designed to resist benchmark saturation by requiring integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—over broadly accessible knowledge. Unlike knowledge-escalation benchmarks (GPQA, HLE) or pure abstraction benchmarks (ARC-AGI), GIM grounds reasoning in realistic tasks without gating on specialized expertise. The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models and 47 test configurations, producing the most extensive published study of test-time compute vs. model capability tradeoffs on a fixed benchmark. A key finding is that within-family configuration choices (thinking budget, quantization) matter as much as model selection.

Frontier Model Releases Evaluation and Benchmarking 2-Parameter Logistic IRT Model GIM (Grounded Integration Measure)test-time compute +4 more