From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels
Hugging Face published a guide on building and scaling production-ready CUDA kernels, covering the full workflow from development to deployment. The post targets ML engineers who need to write custom GPU kernels for inference optimization and production workloads. It addresses practical concerns around kernel compilation, testing, and integration with existing ML frameworks.
Related guides (3)
Related events (8)
Custom CUDA Kernels for All from Codex and Claude
A Hugging Face blog post describes using AI coding agents (Codex and Claude) to automatically generate custom CUDA kernels, lowering the barrier to GPU kernel development. The piece demonstrates agent-assisted GPU programming as a practical workflow for ML practitioners. This represents a concrete application of AI coding tools to the specialized domain of CUDA/GPU optimization.
Easily Build and Share ROCm Kernels with Hugging Face
Hugging Face has published a guide and tooling for building and sharing custom ROCm kernels on its platform, targeting AMD GPU users in the ML ecosystem. The post covers the workflow for packaging, uploading, and reusing ROCm-based GPGPU kernels via the Hub. This lowers the barrier for AMD GPU kernel development and sharing, complementing the existing CUDA-centric kernel ecosystem. The initiative is relevant to inference optimization and the broader push to diversify GPU hardware support in AI workloads.
Creating Custom Kernels for the AMD MI300
A Hugging Face blog post details the process of writing custom GPU kernels targeting the AMD MI300 accelerator. The post covers practical techniques for optimizing AI workloads on AMD hardware, contributing to the growing ecosystem of non-NVIDIA GPU support for ML inference and training. This is relevant to the broader trend of diversifying AI infrastructure beyond CUDA-dominant workflows.
Hugging Face Launches Kernel Hub for Custom GPU Kernels
Hugging Face has introduced the Kernel Hub, a centralized repository for sharing and discovering custom GPU kernels optimized for AI/ML workloads. The platform aims to make high-performance custom CUDA and Triton kernels more accessible to the broader ML community. This represents an infrastructure layer addition to the Hugging Face ecosystem, complementing its existing model and dataset hubs.
Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training
Hugging Face published a guide on N-dimensional parallelism for multi-GPU training using the Accelerate library. The post covers combining data parallelism, tensor parallelism, pipeline parallelism, and other strategies to efficiently scale model training across GPU clusters. This is a practical technical resource aimed at practitioners working with large-scale distributed training setups.
We Got Claude to Build CUDA Kernels and Teach Open Models
A Hugging Face blog post describes using Claude to generate CUDA kernels and then distilling that knowledge into open-weight models. The approach combines LLM-assisted low-level GPU programming with knowledge transfer to smaller open models. This sits at the intersection of AI-assisted systems programming and open-weights capability improvement.
Twelve practical tips for designing AI-driven HPC workflows
A preprint from arXiv offers twelve practical guidelines for researchers designing AI and foundation-model-driven workflows on HPC clusters. The guide addresses system-level challenges including containerisation, job arrays, feedback loop mechanics, and I/O optimisation for small files. The work targets the transition from deterministic linear pipelines to adaptive, probabilistic computational environments, with particular emphasis on computational biology use cases.
Make your ZeroGPU Spaces go brrr with ahead-of-time compilation
Hugging Face introduces ahead-of-time (AOT) compilation support for ZeroGPU Spaces, enabling faster cold-start and inference times by pre-compiling model kernels before deployment. The post explains how AOT compilation reduces the JIT compilation overhead that typically occurs on first inference in ZeroGPU's shared GPU environment. This is a practical infrastructure improvement for developers hosting models on Hugging Face Spaces.


