4Hugging Face Blog·1mo ago

From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels

Hugging Face published a guide on building and scaling production-ready CUDA kernels, covering the full workflow from development to deployment. The post targets ML engineers who need to write custom GPU kernels for inference optimization and production workloads. It addresses practical concerns around kernel compilation, testing, and integration with existing ML frameworks.

Training Infrastructure Inference Economics kernel-builder Hugging Face CUDA

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Custom CUDA Kernels for All from Codex and Claude

A Hugging Face blog post describes using AI coding agents (Codex and Claude) to automatically generate custom CUDA kernels, lowering the barrier to GPU kernel development. The piece demonstrates agent-assisted GPU programming as a practical workflow for ML practitioners. This represents a concrete application of AI coding tools to the specialized domain of CUDA/GPU optimization.

Training Infrastructure Inference Economics Claude Hugging Face OpenAI +4 more

4Hugging Face Blog·1mo ago·source ↗

Easily Build and Share ROCm Kernels with Hugging Face

Hugging Face has published a guide and tooling for building and sharing custom ROCm kernels on its platform, targeting AMD GPU users in the ML ecosystem. The post covers the workflow for packaging, uploading, and reusing ROCm-based GPGPU kernels via the Hub. This lowers the barrier for AMD GPU kernel development and sharing, complementing the existing CUDA-centric kernel ecosystem. The initiative is relevant to inference optimization and the broader push to diversify GPU hardware support in AI workloads.

Training Infrastructure Inference Economics ROCm Hugging Face AMD +1 more

5Hugging Face Blog·1mo ago·source ↗

Creating Custom Kernels for the AMD MI300

A Hugging Face blog post details the process of writing custom GPU kernels targeting the AMD MI300 accelerator. The post covers practical techniques for optimizing AI workloads on AMD hardware, contributing to the growing ecosystem of non-NVIDIA GPU support for ML inference and training. This is relevant to the broader trend of diversifying AI infrastructure beyond CUDA-dominant workflows.

Training Infrastructure Inference Economics ROCm Hugging Face AMD MI300 +1 more

5Hugging Face Blog·1mo ago·source ↗

Hugging Face Launches Kernel Hub for Custom GPU Kernels

Hugging Face has introduced the Kernel Hub, a centralized repository for sharing and discovering custom GPU kernels optimized for AI/ML workloads. The platform aims to make high-performance custom CUDA and Triton kernels more accessible to the broader ML community. This represents an infrastructure layer addition to the Hugging Face ecosystem, complementing its existing model and dataset hubs.

Training Infrastructure Inference Economics Triton Hugging Face Hugging Face Kernel Hub +2 more

5Hugging Face Blog·1mo ago·source ↗

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Hugging Face published a guide on N-dimensional parallelism for multi-GPU training using the Accelerate library. The post covers combining data parallelism, tensor parallelism, pipeline parallelism, and other strategies to efficiently scale model training across GPU clusters. This is a practical technical resource aimed at practitioners working with large-scale distributed training setups.

Training Infrastructure Agent and Tool Ecosystem N-Dimensional Parallelism tensor parallelism pipeline parallelism +3 more

5Hugging Face Blog·1mo ago·source ↗

We Got Claude to Build CUDA Kernels and Teach Open Models

A Hugging Face blog post describes using Claude to generate CUDA kernels and then distilling that knowledge into open-weight models. The approach combines LLM-assisted low-level GPU programming with knowledge transfer to smaller open models. This sits at the intersection of AI-assisted systems programming and open-weights capability improvement.

Training Infrastructure Open Weights Progress Claude Hugging Face CUDA +2 more

3arXiv · cs.AI·12d ago·source ↗

Twelve practical tips for designing AI-driven HPC workflows

A preprint from arXiv offers twelve practical guidelines for researchers designing AI and foundation-model-driven workflows on HPC clusters. The guide addresses system-level challenges including containerisation, job arrays, feedback loop mechanics, and I/O optimisation for small files. The work targets the transition from deterministic linear pipelines to adaptive, probabilistic computational environments, with particular emphasis on computational biology use cases.

Training Infrastructure Enterprise Deployment Patterns Twelve quick tips for designing AI-driven HPC workflows

4Hugging Face Blog·1mo ago·source ↗

Make your ZeroGPU Spaces go brrr with ahead-of-time compilation

Hugging Face introduces ahead-of-time (AOT) compilation support for ZeroGPU Spaces, enabling faster cold-start and inference times by pre-compiling model kernels before deployment. The post explains how AOT compilation reduces the JIT compilation overhead that typically occurs on first inference in ZeroGPU's shared GPU environment. This is a practical infrastructure improvement for developers hosting models on Hugging Face Spaces.

Inference Economics Enterprise Deployment Patterns ZeroGPU Hugging Face Spaces Ahead-of-Time Compilation +1 more