ConSA: Learned FA/SWA allocation for efficient hybrid attention in LLMs
ConSA is a framework that learns optimal assignments between full attention and sliding-window attention layers under a user-specified sparsity target, using L0 regularization and augmented Lagrangian constraints. Evaluated on 0.6B and 1.7B parameter models, learned allocations consistently outperform hand-crafted rule-based baselines, with KV-head-wise granularity outperforming layer-wise. A consistent structural pattern emerges: SWA concentrates in bottom layers while FA clusters in contiguous middle-layer blocks, diverging from the evenly interleaved patterns used in existing hybrid architectures.
Related guides (2)
Related events (8)
HydraHead: Head-level hybridization of full and linear attention for long-context efficiency
Researchers introduce HydraHead, an architecture that hybridizes Full Attention (FA) and Linear Attention (LA) at the head level rather than the conventional layer level, motivated by interpretability findings showing functional heterogeneity among heads within the same layer. An interpretability-driven selection strategy preserves FA only for retrieval-critical heads, achieving a 7:1 LA-to-FA ratio while matching the long-context performance of a 3:1 layer-wise hybrid. Trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5's performance despite that model having a native 256K context window. The work suggests head-level hybridization is a significantly underexplored and high-potential design axis for efficient long-context models.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs
DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.
CLSA: Cross-Layer Sparse Attention with Shared Routing for Efficient Long-Context Inference
Researchers propose Cross-Layer Sparse Attention (CLSA), a method that builds on KV-sharing architectures (like YOCO) to share both the KV cache and the routing index across decoder layers. A single indexer computes token-level top-k selection once and reuses it across layers, reducing routing overhead while preserving fine-grained selectivity. Experiments on short- and long-context benchmarks show up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context, addressing pre-filling, KV-cache storage, and decoding bottlenecks simultaneously.
SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs
Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.
Language Models Need Sleep: Periodic Context Consolidation via Fast Weights and SSM Blocks
This paper proposes a sleep-like consolidation mechanism for transformer-based LLMs to address the quadratic scaling of attention with context length. During 'sleep' phases, the model performs N offline recurrent passes over accumulated context, updating fast weights in state-space model (SSM) blocks via a learned local rule, then clears the KV cache. The approach is evaluated on synthetic tasks (cellular automata, multi-hop graph retrieval) and math reasoning, where standard transformers and SSM-attention hybrids fail, with performance scaling with sleep duration N.
Functional Attention: Reinterpreting Attention as Functional Correspondences for Operator Learning
This paper introduces Functional Attention, a novel attention mechanism for operator learning that replaces standard softmax token-wise affinities with structured linear operators inspired by geometric functional maps. The approach treats attention as a correspondence between adaptive bases rather than discrete tokens, yielding a resolution-invariant, globally-aware representation. Experiments show competitive or state-of-the-art performance on PDE solving, 3D segmentation, and regression tasks, with robustness to varying discretizations.
SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering
SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.
SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning
SMoA is a new parameter-efficient fine-tuning method that addresses LoRA's trade-off between rank size and parameter budget. It partitions model layers into spectral blocks and applies Hadamard-modulated low-rank branches to each diagonal block, enabling broader coverage of pretrained spectral directions without proportionally increasing trainable parameters. Theoretical analysis and empirical results on multiple tasks show SMoA outperforms LoRA and competitive LoRA-style baselines in lower-budget settings.

