Nyströmformer: Approximating Self-Attention in Linear Time and Memory via the Nyström Method
This Hugging Face blog post covers Nyströmformer, a transformer variant that approximates standard self-attention using the Nyström method to achieve linear time and memory complexity. The approach addresses the quadratic scaling bottleneck of standard attention, enabling processing of longer sequences at reduced computational cost. The post likely covers the model's integration into the Hugging Face ecosystem and its practical use cases.
Related guides (3)
Related events (8)
The Reformer - Pushing the limits of language modeling
This Hugging Face blog post covers the Reformer, a memory-efficient transformer architecture that uses locality-sensitive hashing (LSH) attention and reversible residual layers to handle very long sequences. The post explains the technical mechanisms that allow Reformer to process sequences up to 1 million tokens with significantly reduced memory footprint compared to standard transformers. It serves as an educational deep-dive into the architectural innovations introduced in the original Reformer paper by Kitaev et al.
Express: Efficient causal attention approximation with formal guarantees and FlashAttention 2 speedups
A new tool called Express converts non-causal attention approximations into causal ones with matching theoretical guarantees, achieving log^(3/2)(n)/s approximation error with O(s) memory. Combined with the Thinformer approximation and an I/O-aware Triton implementation, it demonstrates substantial speedups over FlashAttention 2. The work targets four practical bottlenecks: long-context prefill, KV cache compression, and both memory- and compute-constrained long-form decoding.
Understanding BigBird's Block Sparse Attention
This Hugging Face blog post provides a technical explanation of BigBird's block sparse attention mechanism, which extends transformer models to handle longer sequences by replacing dense quadratic attention with a combination of local, global, and random sparse attention patterns. The post covers the theoretical underpinnings and implementation details of how BigBird achieves linear complexity with respect to sequence length. It serves as educational commentary on a published research architecture that enables processing of sequences up to 4096 tokens or more efficiently.
Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2
Hugging Face published a blog post describing a technique for improving training efficiency by packing multiple short sequences into a single batch using Flash Attention 2. The approach reduces padding waste and improves GPU utilization during LLM fine-tuning. This is a practical infrastructure optimization relevant to practitioners training models on datasets with variable-length sequences.
Probabilistic Time Series Forecasting with Transformers
This Hugging Face blog post introduces probabilistic time series forecasting using Transformer-based models available in the Hugging Face ecosystem. It covers the application of attention-based architectures to sequential prediction tasks with uncertainty quantification. The post serves as a tutorial and capability demonstration for time series modeling within the Transformers library.
A Failed Experiment: Infini-Attention, and Why We Should Keep Trying?
A Hugging Face blog post documents an attempt to implement and validate Infini-Attention, a technique proposed to extend transformer context length by combining local and compressed global memory. The experiment reportedly failed to reproduce the claimed benefits, raising questions about the reproducibility and practical viability of the approach. The post frames the failure as instructive and argues for continued experimentation with long-context architectures.
Differential Transformer V2
Microsoft has published a blog post on Hugging Face introducing Differential Transformer V2, an updated version of their differential attention mechanism for transformers. The differential attention architecture aims to reduce attention noise by computing attention as a difference between two softmax attention maps. This post likely covers improvements to the original design, training dynamics, or scaling behavior of the V2 iteration.
Language Models Need Sleep: Periodic Context Consolidation via Fast Weights and SSM Blocks
This paper proposes a sleep-like consolidation mechanism for transformer-based LLMs to address the quadratic scaling of attention with context length. During 'sleep' phases, the model performs N offline recurrent passes over accumulated context, updating fast weights in state-space model (SSM) blocks via a learned local rule, then clears the KV cache. The approach is evaluated on synthetic tasks (cellular automata, multi-hop graph retrieval) and math reasoning, where standard transformers and SSM-attention hybrids fail, with performance scaling with sleep duration N.


