Almanac
benchmark

NPHardEval

benchmarkactivenphardeval-256b2b28·1 events·first seen 28d ago

Aliases: NPHardEval

Co-occurring entities

More like this (12)

Recent events (1)

5Hugging Face Blog·28d ago·source ↗

NPHardEval Leaderboard: Benchmarking LLM Reasoning via Computational Complexity Classes

The NPHardEval leaderboard evaluates large language models on reasoning tasks drawn from computational complexity classes (P, NP, NP-Hard), providing a structured framework for assessing algorithmic reasoning capabilities. The benchmark uses dynamic problem updates to mitigate data contamination, a persistent challenge in static benchmarks. Results are hosted on Hugging Face and aim to reveal systematic differences in how frontier models handle problems of varying computational hardness.