Needle-in-a-Haystack
needle-in-a-haystack-4eda8ddf·3 events·first seen 15d agoAliases: Needle-in-a-Haystack, Needle In A Haystack
Co-occurring entities
More like this (12)
Recent events (3)
Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs
Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.
QK-Restore: Fixing long-context recall degradation caused by CoT fine-tuning in hybrid LLMs
Researchers find that chain-of-thought supervised fine-tuning systematically degrades long-context recall in hybrid linear-attention models (HypeNet, Jet-Nemotron), with Needle-In-A-Haystack performance collapsing dramatically—e.g., HypeNet-9B dropping from 67.2% to 9.4% at 256K context. The root cause is identified as CoT-SFT biasing attention gradients toward short-range patterns, corrupting the query-key projections responsible for long-range routing. The paper proposes QK-Restore, a training-free fix that restores only W_Q and W_K from the pre-SFT checkpoint, recovering long-context capability while preserving reasoning gains.
Anthropic launches Claude 3 model family: Haiku, Sonnet, and Opus
Anthropic announced the Claude 3 model family on March 4, 2024, comprising three models — Haiku, Sonnet, and Opus — in ascending capability order. Claude 3 Opus claims top performance on major benchmarks including MMLU, GPQA, and GSM8K, with near-perfect recall on long-context evaluations (200K context window, 99%+ NIAH accuracy) and new multimodal vision capabilities. The release also highlights reduced unnecessary refusals, a twofold accuracy improvement over Claude 2.1, and Constitutional AI-based safety tuning. Opus and Sonnet launched immediately via claude.ai and the Claude API across 159 countries, with Haiku to follow.