Almanac
← Events
6arXiv cs.AI (Artificial Intelligence)·1mo ago

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

Related guides (3)

Related events (8)

6arXiv · cs.CL·46h ago·source ↗

HydraHead: Head-level hybridization of full and linear attention for long-context efficiency

Researchers introduce HydraHead, an architecture that hybridizes Full Attention (FA) and Linear Attention (LA) at the head level rather than the conventional layer level, motivated by interpretability findings showing functional heterogeneity among heads within the same layer. An interpretability-driven selection strategy preserves FA only for retrieval-critical heads, achieving a 7:1 LA-to-FA ratio while matching the long-context performance of a 3:1 layer-wise hybrid. Trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5's performance despite that model having a native 256K context window. The work suggests head-level hybridization is a significantly underexplored and high-potential design axis for efficient long-context models.

6arXiv · cs.CL·15d ago·source ↗

CLSA: Cross-Layer Sparse Attention with Shared Routing for Efficient Long-Context Inference

Researchers propose Cross-Layer Sparse Attention (CLSA), a method that builds on KV-sharing architectures (like YOCO) to share both the KV cache and the routing index across decoder layers. A single indexer computes token-level top-k selection once and reuses it across layers, reducing routing overhead while preserving fine-grained selectivity. Experiments on short- and long-context benchmarks show up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context, addressing pre-filling, KV-cache storage, and decoding bottlenecks simultaneously.

5arXiv · cs.CL·3d ago·source ↗

ConSA: Learned FA/SWA allocation for efficient hybrid attention in LLMs

ConSA is a framework that learns optimal assignments between full attention and sliding-window attention layers under a user-specified sparsity target, using L0 regularization and augmented Lagrangian constraints. Evaluated on 0.6B and 1.7B parameter models, learned allocations consistently outperform hand-crafted rule-based baselines, with KV-head-wise granularity outperforming layer-wise. A consistent structural pattern emerges: SWA concentrates in bottom layers while FA clusters in contiguous middle-layer blocks, diverging from the evenly interleaved patterns used in existing hybrid architectures.

5arXiv · cs.LG·46h ago·source ↗

Multi-Task Bayesian In-Context Learning for Amortized Hierarchical Inference

A new arXiv preprint introduces a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference, representing prior information as a prefix of in-context datasets fed to a transformer. The model learns to adapt predictions across families of priors, addressing the brittleness of prior-data fitted models under distribution shift. On evaluations including out-of-meta-distribution priors and high-dimensional latent structures, the method matches oracle Bayesian predictors while being orders of magnitude faster, with a real-world spatiotemporal temperature prediction demonstration.

6arXiv · cs.CL·25d ago·source ↗

Language Models Need Sleep: Periodic Context Consolidation via Fast Weights and SSM Blocks

This paper proposes a sleep-like consolidation mechanism for transformer-based LLMs to address the quadratic scaling of attention with context length. During 'sleep' phases, the model performs N offline recurrent passes over accumulated context, updating fast weights in state-space model (SSM) blocks via a learned local rule, then clears the KV cache. The approach is evaluated on synthetic tasks (cellular automata, multi-hop graph retrieval) and math reasoning, where standard transformers and SSM-attention hybrids fail, with performance scaling with sleep duration N.

6arXiv · cs.AI·22d ago·source ↗

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA applies Multi-Head Latent Attention (MLA) to causal video diffusion, replacing per-head keys and values with a shared low-rank content latent and decoupled 3D-RoPE positional key, achieving 92.7% reduction in per-token KV memory. The paper investigates why MLA works despite pretrained video attention not being low-rank (unlike the spectral assumption motivating MLA in LLMs), finding that the MLA bottleneck itself determines effective rank rather than the pretrained spectrum. On VBench, VideoMLA matches short-horizon baselines, achieves best overall score at long horizons, and delivers 1.23x throughput improvement on a single NVIDIA B200 GPU.

3Hugging Face Blog·1mo ago·source ↗

Understanding BigBird's Block Sparse Attention

This Hugging Face blog post provides a technical explanation of BigBird's block sparse attention mechanism, which extends transformer models to handle longer sequences by replacing dense quadratic attention with a combination of local, global, and random sparse attention patterns. The post covers the theoretical underpinnings and implementation details of how BigBird achieves linear complexity with respect to sequence length. It serves as educational commentary on a published research architecture that enables processing of sequences up to 4096 tokens or more efficiently.

4arXiv · cs.CL·11d ago·source ↗

Attention Expansion mechanism improves keyphrase extraction from long documents without full-context LLMs

Researchers propose an 'attention expansion' mechanism that augments pre-trained language model token representations with information from out-of-context chunks using static word embeddings, enabling more effective keyphrase extraction from long documents. The approach avoids the computational cost of full-document attention or LLM-based inference while expanding the effective contextual scope of PLM-based models. Evaluated across five PLM backbones and five benchmark corpora, the method consistently improves F1 scores over state-of-the-art baselines in both scientific and news domains.