Almanac
benchmark

HELMET

benchmarkactivehelmet-e66a0908·1 events·first seen 28d ago

Aliases: HELMET

Co-occurring entities

More like this (12)

Recent events (1)

5Hugging Face Blog·28d ago·source ↗

Introducing HELMET: Holistically Evaluating Long-context Language Models

HELMET is a new benchmark designed to holistically evaluate long-context language models across diverse real-world tasks rather than synthetic needle-in-a-haystack tests. The benchmark covers multiple task categories including retrieval, reasoning, summarization, and code, aiming to provide more reliable and comprehensive assessment of long-context capabilities. It is introduced via the Hugging Face blog, suggesting an open release with associated tooling for the community.