Almanac
product

FlashAttention 2

productactiveprovisionalflashattention-2-66322352·1 events·first seen 7d ago

Aliases: FlashAttention 2

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·7d ago·source ↗

Express: Efficient causal attention approximation with formal guarantees and FlashAttention 2 speedups

A new tool called Express converts non-causal attention approximations into causal ones with matching theoretical guarantees, achieving log^(3/2)(n)/s approximation error with O(s) memory. Combined with the Thinformer approximation and an I/O-aware Triton implementation, it demonstrates substantial speedups over FlashAttention 2. The work targets four practical bottlenecks: long-context prefill, KV cache compression, and both memory- and compute-constrained long-form decoding.