5Hugging Face Blog·1mo ago

Introducing HELMET: Holistically Evaluating Long-context Language Models

HELMET is a new benchmark designed to holistically evaluate long-context language models across diverse real-world tasks rather than synthetic needle-in-a-haystack tests. The benchmark covers multiple task categories including retrieval, reasoning, summarization, and code, aiming to provide more reliable and comprehensive assessment of long-context capabilities. It is introduced via the Hugging Face blog, suggesting an open release with associated tooling for the community.

Long Context Evolution Evaluation and Benchmarking HELMET Hugging Face

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Very Large Language Models and How to Evaluate Them

This Hugging Face blog post from October 2022 discusses approaches to zero-shot evaluation of large language models hosted on the Hub. It covers methodologies for benchmarking LLMs without task-specific fine-tuning, addressing the practical challenges of evaluating very large models at scale. The post situates evaluation tooling within the broader ecosystem of open model hosting and assessment.

Evaluation and Benchmarking Open Weights Progress zero-shot evaluation Hugging Face

5Hugging Face Blog·1mo ago·source ↗

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

CyberSecEval 2 is a benchmark framework designed to evaluate both the cybersecurity risks and capabilities of large language models. The framework appears to be hosted or featured on Hugging Face's leaderboard infrastructure, extending prior cybersecurity evaluation work. It assesses LLMs across multiple dimensions of security-relevant behavior, including potential for misuse and defensive capabilities.

Evaluation and Benchmarking AI Safety Research CyberSecEval 2 LlamaGuard Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

Introducing ConTextual: Benchmark for Joint Text-Image Reasoning in Text-Rich Scenes

Hugging Face introduces ConTextual, a new benchmark evaluating multimodal models on their ability to jointly reason over text and images in text-rich scenes. The benchmark targets a specific capability gap where models must integrate visual and textual information simultaneously rather than treating them independently. A leaderboard accompanies the benchmark to track model progress on this task.

Evaluation and Benchmarking Multimodal Progress Hugging Face ConTextual

6arXiv · cs.CL·47h ago·source ↗

HydraHead: Head-level hybridization of full and linear attention for long-context efficiency

Researchers introduce HydraHead, an architecture that hybridizes Full Attention (FA) and Linear Attention (LA) at the head level rather than the conventional layer level, motivated by interpretability findings showing functional heterogeneity among heads within the same layer. An interpretability-driven selection strategy preserves FA only for retrieval-critical heads, achieving a 7:1 LA-to-FA ratio while matching the long-context performance of a 3:1 layer-wise hybrid. Trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5's performance despite that model having a native 256K context window. The work suggests head-level hybridization is a significantly underexplored and high-potential design axis for efficient long-context models.

Long Context Evolution Inference Economics HydraHead Qwen3

5Hugging Face Blog·1mo ago·source ↗

BigCodeBench: The Next Generation of HumanEval

Hugging Face introduces BigCodeBench, a new code generation benchmark designed to succeed HumanEval by offering more challenging and diverse programming tasks. The benchmark aims to better evaluate LLMs on real-world coding scenarios involving complex function calls and library usage. A leaderboard accompanies the release to track model performance across the community.

Evaluation and Benchmarking Agent and Tool Ecosystem BigCodeBench Hugging Face HumanEval

5Hugging Face Blog·1mo ago·source ↗

SmolLM3: Hugging Face Releases Small Multilingual Long-Context Reasoning Model

Hugging Face has released SmolLM3, a compact language model designed for multilingual support, long-context processing, and reasoning capabilities. The model targets the small/efficient model segment while incorporating reasoning features typically associated with larger models. This release continues Hugging Face's SmolLM series aimed at capable but deployable open-weight models.

Long Context Evolution Frontier Model Releases SmolLM Hugging Face SmolLM3 +2 more

5Hugging Face Blog·1mo ago·source ↗

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Hugging Face introduces a leaderboard based on LiveCodeBench, a benchmark designed for holistic and contamination-free evaluation of code-generating large language models. The benchmark continuously collects new coding problems from competitive programming platforms to prevent data contamination that plagues static benchmarks. It evaluates models across multiple code-related tasks beyond just code generation, aiming to provide a more reliable signal of true model capability.

Evaluation and Benchmarking Agent and Tool Ecosystem LiveCodeBench Hugging Face LiveCodeBench Leaderboard

6arXiv · cs.CL·1mo ago·source ↗

LongMINT: Benchmark for Evaluating Memory Under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a new benchmark designed to evaluate memory-augmented agents in realistic long-horizon settings where information is repeatedly updated and interferes across memories. It contains 15.6k QA pairs over contexts averaging 138.8k tokens (up to 1.8M tokens), spanning domains including state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. Evaluation of 7 representative systems—including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks—reveals consistently low average accuracy of 27.9%, with performance particularly degraded on multi-target aggregation tasks and when earlier facts are revised by subsequent context. The analysis identifies retrieval and memory construction as the primary bottlenecks.

Long Context Evolution Evaluation and Benchmarking LongMINT Retrieval-Augmented Generation long-context LLMs +2 more