Entity · benchmark

Needle-in-a-Haystack

benchmarkactiveneedle-in-a-haystack-4eda8ddf·4 events·first seen Jun 1, 2026

Aliases: Needle-in-a-Haystack, Needle In A Haystack

Co-occurring entities

More like this (12)

Visual Haystacks Perplexity Search Good Token Hunting hybrid dense-sparse retrieval LocateAnything Information Bottleneck K-Search likelihood approximation indirect prompt injection peg-in-hole insertion task Perplexity Computer Grasshopper Problem (IMO 2009 P6)

Recent events (4)

5arXiv · cs.AI·Jul 8, 2026·source ↗

DepthWeave-KV: Token-adaptive cross-layer KV cache compression for long-context inference

A new arXiv preprint introduces DepthWeave-KV, a KV cache compression method that factorizes key-value states across neighboring transformer layers using shared low-rank channel bases while retaining token-specific residuals for attention-sensitive positions. A token-conditional depth router allocates higher reconstruction rank to instruction-bearing and retrieval-critical tokens, with calibration-free online error tracking during generation. The method achieves 8.3x KV memory reduction at 64K context while maintaining near-full-cache quality on LongBench, Needle-in-a-Haystack, and L-Eval benchmarks. The work addresses a practical bottleneck in long-context inference without requiring base model retraining.

Long Context Evolution Inference Economics DepthWeave-KV L-Eval Needle-in-a-Haystack +1 more

6arXiv · cs.CL·Jun 10, 2026·source ↗

QK-Restore: Fixing long-context recall degradation caused by CoT fine-tuning in hybrid LLMs

Researchers find that chain-of-thought supervised fine-tuning systematically degrades long-context recall in hybrid linear-attention models (HypeNet, Jet-Nemotron), with Needle-In-A-Haystack performance collapsing dramatically—e.g., HypeNet-9B dropping from 67.2% to 9.4% at 256K context. The root cause is identified as CoT-SFT biasing attention gradients toward short-range patterns, corrupting the query-key projections responsible for long-range routing. The paper proposes QK-Restore, a training-free fix that restores only W_Q and W_K from the pre-SFT checkpoint, recovering long-context capability while preserving reasoning gains.

Long Context Evolution Alignment and RLHF Jet-Nemotron Needle-in-a-Haystack HypeNet +2 more

9Anthropic News·Jun 3, 2026·source ↗

Anthropic launches Claude 3 model family: Haiku, Sonnet, and Opus

Anthropic announced the Claude 3 model family on March 4, 2024, comprising three models — Haiku, Sonnet, and Opus — in ascending capability order. Claude 3 Opus claims top performance on major benchmarks including MMLU, GPQA, and GSM8K, with near-perfect recall on long-context evaluations (200K context window, 99%+ NIAH accuracy) and new multimodal vision capabilities. The release also highlights reduced unnecessary refusals, a twofold accuracy improvement over Claude 2.1, and Constitutional AI-based safety tuning. Opus and Sonnet launched immediately via claude.ai and the Claude API across 159 countries, with Haiku to follow.

Long Context Evolution Frontier Model Releases Claude Opus 4.6 Constitutional AI Claude Haiku 4.5 +8 more

6The Batch·Jun 1, 2026·source ↗

Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs

Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.

Training Infrastructure Long Context Evolution University of California San Diego Mamba Stanford University +13 more