Introducing HELMET: Holistically Evaluating Long-context Language Models
HELMET is a new benchmark designed to holistically evaluate long-context language models across diverse real-world tasks rather than synthetic needle-in-a-haystack tests. The benchmark covers multiple task categories including retrieval, reasoning, summarization, and code, aiming to provide more reliable and comprehensive assessment of long-context capabilities. It is introduced via the Hugging Face blog, suggesting an open release with associated tooling for the community.
Related guides (3)
Related events (8)
Very Large Language Models and How to Evaluate Them
This Hugging Face blog post from October 2022 discusses approaches to zero-shot evaluation of large language models hosted on the Hub. It covers methodologies for benchmarking LLMs without task-specific fine-tuning, addressing the practical challenges of evaluating very large models at scale. The post situates evaluation tooling within the broader ecosystem of open model hosting and assessment.
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
CyberSecEval 2 is a benchmark framework designed to evaluate both the cybersecurity risks and capabilities of large language models. The framework appears to be hosted or featured on Hugging Face's leaderboard infrastructure, extending prior cybersecurity evaluation work. It assesses LLMs across multiple dimensions of security-relevant behavior, including potential for misuse and defensive capabilities.
Introducing ConTextual: Benchmark for Joint Text-Image Reasoning in Text-Rich Scenes
Hugging Face introduces ConTextual, a new benchmark evaluating multimodal models on their ability to jointly reason over text and images in text-rich scenes. The benchmark targets a specific capability gap where models must integrate visual and textual information simultaneously rather than treating them independently. A leaderboard accompanies the benchmark to track model progress on this task.
HydraHead: Head-level hybridization of full and linear attention for long-context efficiency
Researchers introduce HydraHead, an architecture that hybridizes Full Attention (FA) and Linear Attention (LA) at the head level rather than the conventional layer level, motivated by interpretability findings showing functional heterogeneity among heads within the same layer. An interpretability-driven selection strategy preserves FA only for retrieval-critical heads, achieving a 7:1 LA-to-FA ratio while matching the long-context performance of a 3:1 layer-wise hybrid. Trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5's performance despite that model having a native 256K context window. The work suggests head-level hybridization is a significantly underexplored and high-potential design axis for efficient long-context models.
BigCodeBench: The Next Generation of HumanEval
Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.
SmolLM3: Hugging Face Releases Small Multilingual Long-Context Reasoning Model
Hugging Face has released SmolLM3, a compact language model designed for multilingual support, long-context processing, and reasoning capabilities. The model targets the small/efficient model segment while incorporating reasoning features typically associated with larger models. This release continues Hugging Face's SmolLM series aimed at capable but deployable open-weight models.
Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs
Hugging Face introduces a leaderboard based on LiveCodeBench, a benchmark designed for holistic and contamination-free evaluation of code-generating large language models. The benchmark continuously collects new coding problems from competitive programming platforms to prevent data contamination that plagues static benchmarks. It evaluates models across multiple code-related tasks beyond just code generation, aiming to provide a more reliable signal of true model capability.
LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems
LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.


