OpenAI Releases Block-Sparse GPU Kernels for Sparse Neural Networks
OpenAI released optimized GPU kernels targeting block-sparse neural network architectures, claiming orders-of-magnitude speedups over cuBLAS and cuSPARSE depending on sparsity level. The kernels were applied to achieve state-of-the-art results in text sentiment analysis and generative modeling of text and images. This release represents an early infrastructure contribution toward efficient sparse computation in deep learning.
Related guides (3)
Related events (8)
Block Sparse Matrices for Smaller and Faster Language Models
This Hugging Face blog post introduces block sparse matrix techniques as a method to reduce the size and improve the inference speed of language models. Block sparsity enforces structured zero patterns in weight matrices, enabling hardware-friendly sparse operations compared to unstructured sparsity. The post likely covers implementation details and benchmarks showing efficiency gains for transformer-based models.
We Got Claude to Build CUDA Kernels and Teach Open Models
A Hugging Face blog post describes using Claude to generate CUDA kernels and then distilling that knowledge into open-weight models. The approach combines LLM-assisted low-level GPU programming with knowledge transfer to smaller open models. This sits at the intersection of AI-assisted systems programming and open-weights capability improvement.
Generative modeling with sparse transformers
OpenAI introduced the Sparse Transformer, a deep neural network using a modified sparse attention mechanism to model sequences up to 30x longer than previously feasible with standard transformers. The approach sets new benchmarks on text, image, and audio generation tasks. The key algorithmic contribution is factorized sparse attention patterns that reduce the quadratic complexity of full self-attention.
Introducing Triton: Open-source GPU programming for neural networks
OpenAI released Triton 1.0, an open-source Python-like language for GPU programming targeting neural network workloads. It enables researchers without CUDA expertise to write highly efficient GPU kernels, reportedly matching expert-level performance in most cases. The release lowers the barrier to custom GPU kernel development for ML practitioners.
Custom CUDA Kernels for All from Codex and Claude
A Hugging Face blog post describes using AI coding agents (Codex and Claude) to automatically generate custom CUDA kernels, lowering the barrier to GPU kernel development. The piece demonstrates agent-assisted GPU programming as a practical workflow for ML practitioners. This represents a concrete application of AI coding tools to the specialized domain of CUDA/GPU optimization.
OpenAI and Broadcom Announce Strategic Collaboration to Deploy 10 GW of OpenAI-Designed AI Accelerators
OpenAI and Broadcom have announced a multi-year strategic partnership targeting deployment of 10 gigawatts of OpenAI-designed AI accelerators by 2029. The collaboration involves co-developing next-generation AI accelerator systems and Ethernet networking solutions aimed at scalable, energy-efficient AI infrastructure. This represents OpenAI's continued push into custom silicon, reducing dependence on third-party chip suppliers like NVIDIA.
Nvidia releases Nemotron 3 Super 120B-A12B open-weights model with hybrid Mamba-2/MoE architecture
Nvidia released Nemotron 3 Super 120B-A12B, an open-weights LLM with a hybrid Mamba-2/transformer/MoE architecture that activates only 12B parameters per token and supports up to 1 million token context. The model claims the fastest inference speed in its size class at 442 tokens/second and leads open-weights models on PinchBench agentic task evaluation, outperforming larger models including Kimi K2.5 (1T parameters). Nvidia is releasing weights, training data, and recipes under a permissive commercial license, and plans a $26B five-year investment in open-weights models — framed partly as a strategic response to Chinese labs building capable open-weights models on non-Nvidia hardware.
OpenAI partners with Cerebras for 750MW of high-speed AI compute
OpenAI has announced a partnership with Cerebras Systems to add 750MW of AI compute capacity. The collaboration is aimed at reducing inference latency and improving response speeds for ChatGPT and other real-time AI workloads. Cerebras is known for its wafer-scale chip architecture optimized for fast inference.


