Make your ZeroGPU Spaces go brrr with ahead-of-time compilation
Hugging Face introduces ahead-of-time (AOT) compilation support for ZeroGPU Spaces, enabling faster cold-start and inference times by pre-compiling model kernels before deployment. The post explains how AOT compilation reduces the JIT compilation overhead that typically occurs on first inference in ZeroGPU's shared GPU environment. This is a practical infrastructure improvement for developers hosting models on Hugging Face Spaces.
Related guides (3)
Related events (8)
Accelerating over 130,000 Hugging Face Models with ONNX Runtime
Hugging Face and Microsoft have integrated ONNX Runtime (ORT) to accelerate inference for over 130,000 models on the Hugging Face Hub. The integration enables optimized deployment across CPU and GPU hardware without requiring users to manually export or configure ONNX models. This represents a significant expansion of ORT's reach within the open-weights model ecosystem, lowering the barrier to production-grade inference optimization.
AMD + Hugging Face: Large Language Models Out-of-the-Box Acceleration with AMD GPU
Hugging Face and AMD announced integration work enabling out-of-the-box LLM acceleration on AMD GPUs via the Optimum library. The collaboration targets ROCm-based AMD hardware, aiming to reduce friction for users running inference on non-NVIDIA GPU stacks. This represents a continued push to broaden the hardware ecosystem available to open-weights model users.
Hugging Face and AMD Partner to Accelerate Models on CPU and GPU Platforms
Hugging Face and AMD announced a partnership aimed at optimizing and accelerating state-of-the-art AI models across AMD's CPU and GPU hardware platforms. The collaboration targets improved performance for models hosted and distributed through Hugging Face's ecosystem. This represents a strategic move to broaden hardware support beyond NVIDIA-dominated infrastructure in the AI/ML deployment landscape.
How Hugging Face Accelerate Runs Very Large Models Thanks to PyTorch
This Hugging Face blog post explains the technical mechanisms behind the Accelerate library for running large models that exceed single-GPU memory, leveraging PyTorch features such as device maps, CPU/disk offloading, and sharded checkpoints. It describes how models can be distributed across multiple GPUs, CPU RAM, and disk storage transparently. The post serves as both documentation and a technical explainer for practitioners working with large-scale inference and deployment.
Fit More and Train Faster With ZeRO via DeepSpeed and FairScale
This Hugging Face blog post from January 2021 covers integration of ZeRO (Zero Redundancy Optimizer) memory optimization techniques via DeepSpeed and FairScale into the Transformers training ecosystem. ZeRO partitions optimizer states, gradients, and model parameters across GPUs to enable training of much larger models on the same hardware. The post serves as a practical guide for practitioners looking to scale model training without additional infrastructure investment.
Bringing Serverless GPU Inference to Hugging Face Users via Cloudflare Workers AI
Hugging Face and Cloudflare have partnered to bring serverless GPU inference to Hugging Face users through Cloudflare Workers AI. The integration allows developers to run Hugging Face models on Cloudflare's global edge network without managing GPU infrastructure. This represents an expansion of serverless inference options for the Hugging Face ecosystem, lowering the barrier to deploying ML models at scale.
From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels
Hugging Face published a guide on building and scaling production-ready CUDA kernels, covering the full workflow from development to deployment. The post targets ML engineers who need to write custom GPU kernels for inference optimization and production workloads. It addresses practical concerns around kernel compilation, testing, and integration with existing ML frameworks.
Hugging Face Launches Kernel Hub for Custom GPU Kernels
Hugging Face has introduced the Kernel Hub, a centralized repository for sharing and discovering custom GPU kernels optimized for AI/ML workloads. The platform aims to make high-performance custom CUDA and Triton kernels more accessible to the broader ML community. This represents an infrastructure layer addition to the Hugging Face ecosystem, complementing its existing model and dataset hubs.


