NPHardEval Leaderboard: Benchmarking LLM Reasoning via Computational Complexity Classes
The NPHardEval leaderboard evaluates large language models on reasoning tasks drawn from computational complexity classes (P, NP, NP-Hard), providing a structured framework for assessing algorithmic reasoning capabilities. The benchmark uses dynamic problem updates to mitigate data contamination, a persistent challenge in static benchmarks. Results are hosted on Hugging Face and aim to reveal systematic differences in how frontier models handle problems of varying computational hardness.
Related guides (3)
Related events (8)
Learning to Reason with LLMs
OpenAI announced a new model or capability focused on reasoning in large language models, published on September 12, 2024. The post, hosted on the OpenAI blog, describes advances in training LLMs to perform complex multi-step reasoning. This likely corresponds to the release of the o1 (formerly 'Strawberry') model series, which uses chain-of-thought reasoning trained via reinforcement learning to achieve significantly improved performance on math, science, and coding benchmarks.
Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard
Hugging Face introduces AraGen, a new Arabic-language LLM benchmark and leaderboard built around the 3C3H evaluation framework (Correctness, Completeness, Conciseness, Helpfulness, Harmlessness, Honesty). The benchmark targets a gap in non-English LLM evaluation, specifically for Arabic, using a structured multi-criteria rubric rather than simple accuracy metrics. The leaderboard is hosted on Hugging Face and aims to provide a more holistic assessment of Arabic generative capabilities across frontier and open-weight models.
An Introduction to AI Secure LLM Safety Leaderboard
Hugging Face introduces the DecodingTrust-based LLM Safety Leaderboard, a benchmark framework for evaluating large language models across multiple safety and trustworthiness dimensions. The leaderboard aims to provide standardized, reproducible safety assessments covering areas such as toxicity, stereotype bias, adversarial robustness, and privacy. It offers a public ranking of models to help researchers and practitioners compare safety properties across different LLMs.
Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance
A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.
Resolution Diagnostics for Paired LLM Evaluation: Many Leaderboard Rankings Statistically Unresolved
This paper frames pairwise LLM evaluation as a hypothesis-testing problem and introduces a resolution ratio q = N/N* to diagnose whether leaderboard comparisons are statistically powered. Applying this to two public leaderboards, the authors find that 11/40 Open LLM Leaderboard v1 pairwise comparisons and 4-6/9 MMLU-Pro top-10 adjacent-rank pairs fail to meet conventional (alpha=0.05, power=0.8) resolution targets. A key finding is that the widely-used unpaired Cohen-h shortcut underestimates required sample size by approximately a factor of two in close-comparison regimes, a flaw silently inherited by three major statistical calculators. The unresolved-pair pattern persists under multiplicity correction and sequential testing.
Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages
Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.
BigCodeBench: The Next Generation of HumanEval
Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.
GIM: A Grounded Integration Measure Benchmark for Evaluating Multi-Domain Cognitive Coordination in LLMs
The Grounded Integration Measure (GIM) is a new LLM benchmark of 820 original problems designed to resist benchmark saturation by requiring integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—over broadly accessible knowledge. Unlike knowledge-escalation benchmarks (GPQA, HLE) or pure abstraction benchmarks (ARC-AGI), GIM grounds reasoning in realistic tasks without gating on specialized expertise. The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models and 47 test configurations, producing the most extensive published study of test-time compute vs. model capability tradeoffs on a fixed benchmark. A key finding is that within-family configuration choices (thinking budget, quantization) matter as much as model selection.


