Almanac
← Events
5arXiv cs.CL (Computation and Language)·37h ago

NLL-guided training-free method selects optimal full-attention layers for efficient long-context inference

Researchers propose NLL-guided layer selection, a training-free technique for hybrid attention models that identifies which layers should use full versus sliding-window attention by measuring negative log-likelihood degradation on answer tokens. On LongMemEval with Qwen3-4B, the method achieves 64.6% accuracy using only 1/4 full-attention layers, matching a 1/2-FA periodic baseline while halving compute, and outperforming a periodic 1/4-FA baseline by 10.4 percentage points. The calibration procedure requires approximately 15 minutes of one-time compute, making it practical for deployment. The work advances the efficiency-accuracy tradeoff for long-context LLM inference without requiring any retraining.

Related guides (2)

Related events (8)

6arXiv · cs.CL·11d ago·source ↗

HydraHead: Head-level hybridization of full and linear attention for long-context efficiency

Researchers introduce HydraHead, an architecture that hybridizes Full Attention (FA) and Linear Attention (LA) at the head level rather than the conventional layer level, motivated by interpretability findings showing functional heterogeneity among heads within the same layer. An interpretability-driven selection strategy preserves FA only for retrieval-critical heads, achieving a 7:1 LA-to-FA ratio while matching the long-context performance of a 3:1 layer-wise hybrid. Trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5's performance despite that model having a native 256K context window. The work suggests head-level hybridization is a significantly underexplored and high-potential design axis for efficient long-context models.

5arXiv · cs.AI·20d ago·source ↗

CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup

Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.

6arXiv · cs.CL·20d ago·source ↗

QK-Restore: Fixing long-context recall degradation caused by CoT fine-tuning in hybrid LLMs

Researchers find that chain-of-thought supervised fine-tuning systematically degrades long-context recall in hybrid linear-attention models (HypeNet, Jet-Nemotron), with Needle-In-A-Haystack performance collapsing dramatically—e.g., HypeNet-9B dropping from 67.2% to 9.4% at 256K context. The root cause is identified as CoT-SFT biasing attention gradients toward short-range patterns, corrupting the query-key projections responsible for long-range routing. The paper proposes QK-Restore, a training-free fix that restores only W_Q and W_K from the pre-SFT checkpoint, recovering long-context capability while preserving reasoning gains.

6arXiv · cs.AI·1mo ago·source ↗

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

5arXiv · cs.CL·13h ago·source ↗

FlashMorph: Learned layer selection for converting Transformers to hybrid attention models

This arXiv paper introduces FlashMorph, a method for converting standard Transformer models into hybrid attention architectures by optimally selecting which layers retain full attention versus linear attention. Rather than using heuristic placement patterns, FlashMorph frames layer selection as a budget-constrained subset optimization, jointly learning layerwise gates on synthetic long-context retrieval data with a linearization regularization term. Experiments show FlashMorph finds more effective hybrid configurations that preserve long-context recall and general benchmark performance while reducing selection cost compared to prior methods. The work addresses a practical efficiency problem in deploying long-context models at scale.

4arXiv · cs.CL·20d ago·source ↗

Attention Expansion mechanism improves keyphrase extraction from long documents without full-context LLMs

Researchers propose an 'attention expansion' mechanism that augments pre-trained language model token representations with information from out-of-context chunks using static word embeddings, enabling more effective keyphrase extraction from long documents. The approach avoids the computational cost of full-document attention or LLM-based inference while expanding the effective contextual scope of PLM-based models. Evaluated across five PLM backbones and five benchmark corpora, the method consistently improves F1 scores over state-of-the-art baselines in both scientific and news domains.

5arXiv · cs.CL·13d ago·source ↗

ConSA: Learned FA/SWA allocation for efficient hybrid attention in LLMs

ConSA is a framework that learns optimal assignments between full attention and sliding-window attention layers under a user-specified sparsity target, using L0 regularization and augmented Lagrangian constraints. Evaluated on 0.6B and 1.7B parameter models, learned allocations consistently outperform hand-crafted rule-based baselines, with KV-head-wise granularity outperforming layer-wise. A consistent structural pattern emerges: SWA concentrates in bottom layers while FA clusters in contiguous middle-layer blocks, diverging from the evenly interleaved patterns used in existing hybrid architectures.

6arXiv · cs.CL·1mo ago·source ↗

Language Models Need Sleep: Periodic Context Consolidation via Fast Weights and SSM Blocks

This paper proposes a sleep-like consolidation mechanism for transformer-based LLMs to address the quadratic scaling of attention with context length. During 'sleep' phases, the model performs N offline recurrent passes over accumulated context, updating fast weights in state-space model (SSM) blocks via a learned local rule, then clears the KV cache. The approach is evaluated on synthetic tasks (cellular automata, multi-hop graph retrieval) and math reasoning, where standard transformers and SSM-attention hybrids fail, with performance scaling with sleep duration N.