Entity · technique

FlashAttention-3

techniqueactiveflashattention-3-02bdfde4·3 events·first seen May 19, 2026

Aliases: FlashAttention-3, FlashAttention, FlashAttention2

Co-occurring entities

Qwen3 Value-aware Stochastic KV Cache Eviction Mistral AI MT-Bench Mistral 7B Instruct v0.2 CodeLlama 7B Sliding Window Attention Llama 2 Mistral 7B CoreWeave MMLU HuggingFace Grouped-Query Attention vLLM Triton InfLLMv2 NSA DashAttention α-entmax

More like this (12)

Flash Attention 2 FlashAttention 2 DFlash DashAttention Gemini 3 Flash AdaFlash ElevenLabs Flash v2.5 Gemini 3.5 Flash Lightning Attention GLM-4.7-Flash Gemini 3.1 Flash Live FlashMorph

Recent events (3)

6arXiv · cs.CL·Jun 3, 2026·source ↗

VaSE: Value-Aware Stochastic KV Cache Eviction improves reasoning model efficiency

A new arXiv preprint introduces Value-aware Stochastic KV Cache Eviction (VaSE), a training-free method for compressing KV caches in long-chain-of-thought reasoning models. The authors identify two key failure modes in prior eviction approaches — catastrophic repetition loops caused by evicting high-magnitude value states, and low cache diversity — and address both with targeted protections and stochastic eviction. On six reasoning tasks with Qwen3 models at 4x compression, VaSE outperforms the current best selection-based sparse attention method and exceeds the strongest eviction baseline by over 4%, while supporting FlashAttention2 and maintaining a static memory footprint.

Frontier Model Releases Inference Economics FlashAttention-3 Qwen3 Value-aware Stochastic KV Cache Eviction

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral 7B: Open-Weights 7B Model Outperforming Llama 2 13B

Mistral AI released Mistral 7B, a 7.3B parameter language model under the Apache 2.0 license that outperforms Llama 2 13B across all evaluated benchmarks and approaches Llama 34B on many tasks. The model employs Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at reduced cost, achieving roughly 2x speed improvement at 16k sequence length. A fine-tuned chat variant, Mistral 7B Instruct, outperforms all 7B chat models on MT-Bench and is competitive with 13B-class chat models. The release includes deployment support for AWS, GCP, Azure, HuggingFace, and local use via vLLM.

Long Context Evolution Frontier Model Releases Mistral AI MT-Bench Mistral 7B Instruct v0.2 +13 more

6arXiv · cs.AI·May 19, 2026·source ↗

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

Training Infrastructure Long Context Evolution Triton InfLLMv2 FlashAttention-3 +4 more