Entity · product

Triton

productactivetriton-369b30e8·5 events·first seen May 19, 2026

Aliases: Triton

Co-occurring entities

CUDA Thinformer FlashAttention 2 Express Mamba Gated DeltaNet-2 Dynamic Short Convolutions Improve Transformers Python OpenAI Hugging Face Hugging Face Kernel Hub InfLLMv2 FlashAttention-3 NSA DashAttention α-entmax

More like this (12)

Neptune Hyperion Apollo Mistral Nemo Poseidon DINO AlphaTensor Merlini Hermes Surfer-H TruLens Megatron-LM

Recent events (5)

6arXiv · cs.LG·Jun 10, 2026·source ↗

Express: Efficient causal attention approximation with formal guarantees and FlashAttention 2 speedups

A new tool called Express converts non-causal attention approximations into causal ones with matching theoretical guarantees, achieving log^(3/2)(n)/s approximation error with O(s) memory. Combined with the Thinformer approximation and an I/O-aware Triton implementation, it demonstrates substantial speedups over FlashAttention 2. The work targets four practical bottlenecks: long-context prefill, KV cache compression, and both memory- and compute-constrained long-form decoding.

Training Infrastructure Long Context Evolution Triton Thinformer FlashAttention 2 +2 more

6arXiv · cs.CL·Jun 3, 2026·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

Training Infrastructure Frontier Model Releases Triton Mamba Gated DeltaNet-2 +1 more

7Openai Blog·May 20, 2026·source ↗

Introducing Triton: Open-source GPU programming for neural networks

OpenAI released Triton 1.0, an open-source Python-like language for GPU programming targeting neural network workloads. It enables researchers without CUDA expertise to write highly efficient GPU kernels, reportedly matching expert-level performance in most cases. The release lowers the barrier to custom GPU kernel development for ML practitioners.

Training Infrastructure Inference Economics Triton Python OpenAI +2 more

5Hugging Face Blog·May 19, 2026·source ↗

Hugging Face Launches Kernel Hub for Custom GPU Kernels

Hugging Face has introduced the Kernel Hub, a centralized repository for sharing and discovering custom GPU kernels optimized for AI/ML workloads. The platform aims to make high-performance custom CUDA and Triton kernels more accessible to the broader ML community. This represents an infrastructure layer addition to the Hugging Face ecosystem, complementing its existing model and dataset hubs.

Training Infrastructure Inference Economics Triton Hugging Face Hugging Face Kernel Hub +2 more

6arXiv · cs.AI·May 19, 2026·source ↗

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

Training Infrastructure Long Context Evolution Triton InfLLMv2 FlashAttention-3 +4 more