6arXiv cs.AI (Artificial Intelligence)·20h ago

HOLA adds hippocampal exact KV cache to linear attention, closing gap with full-attention Transformers

HOLA (Hippocampal Linear Attention) augments linear-attention and state-space models with a bounded exact key-value cache inspired by Complementary Learning Systems theory, addressing the lossy compression problem that causes earlier facts to be overwritten in recurrent states. The cache uses a residual-based eviction criterion (large beta * ||e||) without a learned eviction module, and a decoupled RMSNorm-gamma read for sharp retrieval. At 340M parameters trained on 15B SlimPajama tokens, HOLA reduces Wikitext perplexity from 27.32 to 22.92, falling below a full-attention Transformer++ baseline, and shows strong needle-in-a-haystack recall out to 32k tokens despite training only at 2k. The work is directly relevant to the open question of whether linear-attention models can match full-attention on long-context retrieval tasks.

Long Context Evolution WikiText-2 LAMBADA SlimPajama Complementary Learning Systems HOLA (Hippocampal Linear Attention)RULER

Related guides (1)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·Jun 19, 2026·source ↗

HydraHead: Head-level hybridization of full and linear attention for long-context efficiency

Researchers introduce HydraHead, an architecture that hybridizes Full Attention (FA) and Linear Attention (LA) at the head level rather than the conventional layer level, motivated by interpretability findings showing functional heterogeneity among heads within the same layer. An interpretability-driven selection strategy preserves FA only for retrieval-critical heads, achieving a 7:1 LA-to-FA ratio while matching the long-context performance of a 3:1 layer-wise hybrid. Trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5's performance despite that model having a native 256K context window. The work suggests head-level hybridization is a significantly underexplored and high-potential design axis for efficient long-context models.

Long Context Evolution Inference Economics HydraHead Qwen3

6arXiv · cs.CL·Jun 10, 2026·source ↗

QK-Restore: Fixing long-context recall degradation caused by CoT fine-tuning in hybrid LLMs

Researchers find that chain-of-thought supervised fine-tuning systematically degrades long-context recall in hybrid linear-attention models (HypeNet, Jet-Nemotron), with Needle-In-A-Haystack performance collapsing dramatically—e.g., HypeNet-9B dropping from 67.2% to 9.4% at 256K context. The root cause is identified as CoT-SFT biasing attention gradients toward short-range patterns, corrupting the query-key projections responsible for long-range routing. The paper proposes QK-Restore, a training-free fix that restores only W_Q and W_K from the pre-SFT checkpoint, recovering long-context capability while preserving reasoning gains.

Long Context Evolution Alignment and RLHF Jet-Nemotron Needle-in-a-Haystack HypeNet +2 more

6arXiv · cs.CL·May 26, 2026·source ↗

Language Models Need Sleep: Periodic Context Consolidation via Fast Weights and SSM Blocks

This paper proposes a sleep-like consolidation mechanism for transformer-based LLMs to address the quadratic scaling of attention with context length. During 'sleep' phases, the model performs N offline recurrent passes over accumulated context, updating fast weights in state-space model (SSM) blocks via a learned local rule, then clears the KV cache. The approach is evaluated on synthetic tasks (cellular automata, multi-hop graph retrieval) and math reasoning, where standard transformers and SSM-attention hybrids fail, with performance scaling with sleep duration N.

Long Context Evolution Frontier Model Releases Transformers Key-Value Cache Sleep Consolidation Mechanism +6 more

6arXiv · cs.AI·May 19, 2026·source ↗

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

Training Infrastructure Long Context Evolution Triton InfLLMv2 FlashAttention-3 +4 more

6arXiv · cs.CL·Jun 5, 2026·source ↗

CLSA: Cross-Layer Sparse Attention with Shared Routing for Efficient Long-Context Inference

Researchers propose Cross-Layer Sparse Attention (CLSA), a method that builds on KV-sharing architectures (like YOCO) to share both the KV cache and the routing index across decoder layers. A single indexer computes token-level top-k selection once and reuses it across layers, reducing routing overhead while preserving fine-grained selectivity. Experiments on short- and long-context benchmarks show up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context, addressing pre-filling, KV-cache storage, and decoding bottlenecks simultaneously.

Long Context Evolution Inference Economics CLSA YOCO Cross-Layer Sparse Attention with Shared Routing +1 more

7arXiv · cs.CL·Jun 9, 2026·source ↗

Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale

Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.

Long Context Evolution Inference Economics End-to-End Context Compression at Scale Latent Context Language Models +1 more

5arXiv · cs.CL·3d ago·source ↗

FlashMorph: Learned layer selection for converting Transformers to hybrid attention models

This arXiv paper introduces FlashMorph, a method for converting standard Transformer models into hybrid attention architectures by optimally selecting which layers retain full attention versus linear attention. Rather than using heuristic placement patterns, FlashMorph frames layer selection as a budget-constrained subset optimization, jointly learning layerwise gates on synthetic long-context retrieval data with a linearization regularization term. Experiments show FlashMorph finds more effective hybrid configurations that preserve long-context recall and general benchmark performance while reducing selection cost compared to prior methods. The work addresses a practical efficiency problem in deploying long-context models at scale.

Long Context Evolution Inference Economics FlashMorph

6arXiv · cs.AI·May 29, 2026·source ↗

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA applies Multi-Head Latent Attention (MLA) to causal video diffusion, replacing per-head keys and values with a shared low-rank content latent and decoupled 3D-RoPE positional key, achieving 92.7% reduction in per-token KV memory. The paper investigates why MLA works despite pretrained video attention not being low-rank (unlike the spectral assumption motivating MLA in LLMs), finding that the MLA bottleneck itself determines effective rank rather than the pretrained spectrum. On VBench, VideoMLA matches short-horizon baselines, achieves best overall score at long horizons, and delivers 1.23x throughput improvement on a single NVIDIA B200 GPU.

Training Infrastructure Long Context Evolution NVIDIA B200 KV Cache 3D-RoPE +5 more

HOLA adds hippocampal exact KV cache to linear attention, closing gap with full-attention Transformers

Related events (8)

6arXiv · cs.CL·Jun 19, 2026·source ↗