technique
Cross-Layer Sparse Attention
techniqueactiveprovisional
cross-layer-sparse-attention-9a2a86d7·1 events·first seen 11d agoAliases: Cross-Layer Sparse Attention
Co-occurring entities
More like this (12)
Cross-Layer Sparse Attention with Shared RoutingBlock Sparse AttentionMiniMax Sparse Attentionsparse attentionProbSparse AttentionDeepSeek Sparse AttentionMulti-head Latent Attention (MLA)cross-attentionLocality-Sensitive Hashing AttentionSliding Window AttentionSparse Transformerbidirectional attention
Recent events (1)
CLSA: Cross-Layer Sparse Attention with Shared Routing for Efficient Long-Context Inference
Researchers propose Cross-Layer Sparse Attention (CLSA), a method that builds on KV-sharing architectures (like YOCO) to share both the KV cache and the routing index across decoder layers. A single indexer computes token-level top-k selection once and reuses it across layers, reducing routing overhead while preserving fine-grained selectivity. Experiments on short- and long-context benchmarks show up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context, addressing pre-filling, KV-cache storage, and decoding bottlenecks simultaneously.