benchmark
HELMET
benchmarkactive
helmet-e66a0908·1 events·first seen 28d agoAliases: HELMET
Co-occurring entities
More like this (12)
Recent events (1)
Introducing HELMET: Holistically Evaluating Long-context Language Models
HELMET is a new benchmark designed to holistically evaluate long-context language models across diverse real-world tasks rather than synthetic needle-in-a-haystack tests. The benchmark covers multiple task categories including retrieval, reasoning, summarization, and code, aiming to provide more reliable and comprehensive assessment of long-context capabilities. It is introduced via the Hugging Face blog, suggesting an open release with associated tooling for the community.