Almanac
product

Triton

productactivetriton-369b30e8·5 events·first seen 28d ago

Aliases: Triton

Co-occurring entities

More like this (12)

Recent events (5)

7Openai Blog·28d ago·source ↗

Introducing Triton: Open-source GPU programming for neural networks

OpenAI released Triton 1.0, an open-source Python-like language for GPU programming targeting neural network workloads. It enables researchers without CUDA expertise to write highly efficient GPU kernels, reportedly matching expert-level performance in most cases. The release lowers the barrier to custom GPU kernel development for ML practitioners.

5Hugging Face Blog·28d ago·source ↗

Hugging Face Launches Kernel Hub for Custom GPU Kernels

Hugging Face has introduced the Kernel Hub, a centralized repository for sharing and discovering custom GPU kernels optimized for AI/ML workloads. The platform aims to make high-performance custom CUDA and Triton kernels more accessible to the broader ML community. This represents an infrastructure layer addition to the Hugging Face ecosystem, complementing its existing model and dataset hubs.

6arXiv · cs.AI·28d ago·source ↗

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention for Long-Context LLMs

DashAttention introduces a two-stage hierarchical sparse attention mechanism that replaces the fixed top-k block selection used in methods like NSA and InfLLMv2 with an adaptive α-entmax transformation, allowing a variable number of KV blocks to be selected per query. The approach keeps the full hierarchy differentiable by using the first-stage selection as a prior for second-stage softmax attention. Experiments show comparable accuracy to full attention at 75% sparsity with a better Pareto frontier than competing methods, and a Triton GPU implementation achieves meaningful speedup over FlashAttention-3 at inference time.

6arXiv · cs.CL·13d ago·source ↗

Dynamic short convolutions yield 1.33–1.60× compute advantage over standard Transformers

A new arXiv preprint introduces dynamic short convolutions as an architectural primitive for Transformers, using input-dependent filters to combine locality bias with increased expressivity. Experiments across 150M–2B parameter language models show consistent perplexity improvements over standard Transformers and static convolution variants, with scaling-law fits indicating a 1.33× compute advantage when applied to key/query/value vectors and 1.60× when added after every linear layer. The technique also improves linear RNNs (Mamba-2, Gated DeltaNet) and mixture-of-experts architectures, with custom Triton kernels making training practical.

6arXiv · cs.LG·7d ago·source ↗

Express: Efficient causal attention approximation with formal guarantees and FlashAttention 2 speedups

A new tool called Express converts non-causal attention approximations into causal ones with matching theoretical guarantees, achieving log^(3/2)(n)/s approximation error with O(s) memory. Combined with the Thinformer approximation and an I/O-aware Triton implementation, it demonstrates substantial speedups over FlashAttention 2. The work targets four practical bottlenecks: long-context prefill, KV cache compression, and both memory- and compute-constrained long-form decoding.