5arXiv cs.CL (Computation and Language)·14h ago

FlashMorph: Learned layer selection for converting Transformers to hybrid attention models

This arXiv paper introduces FlashMorph, a method for converting standard Transformer models into hybrid attention architectures by optimally selecting which layers retain full attention versus linear attention. Rather than using heuristic placement patterns, FlashMorph frames layer selection as a budget-constrained subset optimization, jointly learning layerwise gates on synthetic long-context retrieval data with a linearization regularization term. Experiments show FlashMorph finds more effective hybrid configurations that preserve long-context recall and general benchmark performance while reducing selection cost compared to prior methods. The work addresses a practical efficiency problem in deploying long-context models at scale.

Long Context Evolution Inference Economics FlashMorph

Related guides (2)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·11d ago·source ↗

HydraHead: Head-level hybridization of full and linear attention for long-context efficiency

Researchers introduce HydraHead, an architecture that hybridizes Full Attention (FA) and Linear Attention (LA) at the head level rather than the conventional layer level, motivated by interpretability findings showing functional heterogeneity among heads within the same layer. An interpretability-driven selection strategy preserves FA only for retrieval-critical heads, achieving a 7:1 LA-to-FA ratio while matching the long-context performance of a 3:1 layer-wise hybrid. Trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5's performance despite that model having a native 256K context window. The work suggests head-level hybridization is a significantly underexplored and high-potential design axis for efficient long-context models.

Long Context Evolution Inference Economics HydraHead Qwen3

6arXiv · cs.LG·20d ago·source ↗

Express: Efficient causal attention approximation with formal guarantees and FlashAttention 2 speedups

A new tool called Express converts non-causal attention approximations into causal ones with matching theoretical guarantees, achieving log^(3/2)(n)/s approximation error with O(s) memory. Combined with the Thinformer approximation and an I/O-aware Triton implementation, it demonstrates substantial speedups over FlashAttention 2. The work targets four practical bottlenecks: long-context prefill, KV cache compression, and both memory- and compute-constrained long-form decoding.

Training Infrastructure Long Context Evolution Triton Thinformer FlashAttention 2 +2 more

5arXiv · cs.CL·38h ago·source ↗

NLL-guided training-free method selects optimal full-attention layers for efficient long-context inference

Researchers propose NLL-guided layer selection, a training-free technique for hybrid attention models that identifies which layers should use full versus sliding-window attention by measuring negative log-likelihood degradation on answer tokens. On LongMemEval with Qwen3-4B, the method achieves 64.6% accuracy using only 1/4 full-attention layers, matching a 1/2-FA periodic baseline while halving compute, and outperforming a periodic 1/4-FA baseline by 10.4 percentage points. The calibration procedure requires approximately 15 minutes of one-time compute, making it practical for deployment. The work advances the efficiency-accuracy tradeoff for long-context LLM inference without requiring any retraining.

Long Context Evolution Inference Economics LongMemEval Qwen3-4B NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation +1 more

4Hugging Face Blog·1mo ago·source ↗

Nyströmformer: Approximating Self-Attention in Linear Time and Memory via the Nyström Method

This Hugging Face blog post covers Nyströmformer, a transformer variant that approximates standard self-attention using the Nyström method to achieve linear time and memory complexity. The approach addresses the quadratic scaling bottleneck of standard attention, enabling processing of longer sequences at reduced computational cost. The post likely covers the model's integration into the Hugging Face ecosystem and its practical use cases.

Long Context Evolution Inference Economics Nyströmformer Nyström method Hugging Face +1 more

6arXiv · cs.AI·1mo ago·source ↗

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

Training Infrastructure Long Context Evolution Triton InfLLMv2 FlashAttention-3 +4 more

6arXiv · cs.CL·27d ago·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

Training Infrastructure Frontier Model Releases Triton Mamba Gated DeltaNet-2 +1 more

4Hugging Face Blog·1mo ago·source ↗

Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2

Hugging Face published a blog post describing a technique for improving training efficiency by packing multiple short sequences into a single batch using Flash Attention 2. The approach reduces padding waste and improves GPU utilization during LLM fine-tuning. This is a practical infrastructure optimization relevant to practitioners training models on datasets with variable-length sequences.

Training Infrastructure Inference Economics Hugging Face Flash Attention 2 sequence packing

5Hugging Face Blog·5d ago·source ↗

AllenAI analysis: which tokens do hybrid models predict better than pure transformers?

A Hugging Face blog post from AllenAI investigates the token-level prediction differences between hybrid models (combining attention and state-space or other mechanisms) and standard transformer architectures. The analysis aims to characterize where hybrid architectures gain or lose predictive advantage at the token level. This kind of mechanistic comparison is relevant to ongoing debates about when hybrid designs are worth their added complexity.

Frontier Model Releases Evaluation and Benchmarking Hugging Face Allen Institute for AI