Entity · technique

InfLLMv2

techniqueactiveinfllmv2-3b9e93bd·1 events·first seen May 19, 2026

Aliases: InfLLMv2

Co-occurring entities

Triton FlashAttention-3 NSA DashAttention α-entmax

More like this (12)

LiteLLM IFLLM vLLM vllm-project whichllm SmolVLM2 SmolLM2 LLM CLI 3LM EvalLLM LLM.int8 LLM (CLI tool)

Recent events (1)

6arXiv · cs.AI·May 19, 2026·source ↗

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

Training Infrastructure Long Context Evolution Triton InfLLMv2 FlashAttention-3 +4 more