5Hugging Face Blog·1mo ago

Accelerating over 130,000 Hugging Face Models with ONNX Runtime

Hugging Face and Microsoft have integrated ONNX Runtime (ORT) to accelerate inference for over 130,000 models on the Hugging Face Hub. The integration enables optimized deployment across CPU and GPU hardware without requiring users to manually export or configure ONNX models. This represents a significant expansion of ORT's reach within the open-weights model ecosystem, lowering the barrier to production-grade inference optimization.

Open Weights Progress Inference Economics Enterprise Deployment Patterns Optimum Microsoft ONNX Hugging Face

Related guides (4)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Microsoft

Microsoft: The AI Infrastructure Giant Betting on Every Horse

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Optimum + ONNX Runtime: Faster Training for Hugging Face Models

Hugging Face's Optimum library integrates with Microsoft's ONNX Runtime Training to accelerate fine-tuning of transformer models. The integration aims to reduce training time and memory usage with minimal code changes for practitioners using the Hugging Face ecosystem. This tooling update targets enterprise and research users looking to optimize training efficiency on existing hardware.

Training Infrastructure Agent and Tool Ecosystem Optimum Microsoft ONNX +1 more

5Hugging Face Blog·1mo ago·source ↗

Hugging Face and AMD Partner to Accelerate Models on CPU and GPU Platforms

Hugging Face and AMD announced a partnership aimed at optimizing and accelerating state-of-the-art AI models across AMD's CPU and GPU hardware platforms. The collaboration targets improved performance for models hosted and distributed through Hugging Face's ecosystem. This represents a strategic move to broaden hardware support beyond NVIDIA-dominated infrastructure in the AI/ML deployment landscape.

Training Infrastructure Inference Economics Hugging Face AMD +1 more

5Hugging Face Blog·1mo ago·source ↗

AMD + Hugging Face: Large Language Models Out-of-the-Box Acceleration with AMD GPU

Hugging Face and AMD announced integration work enabling out-of-the-box LLM acceleration on AMD GPUs via the Optimum library. The collaboration targets ROCm-based AMD hardware, aiming to reduce friction for users running inference on non-NVIDIA GPU stacks. This represents a continued push to broaden the hardware ecosystem available to open-weights model users.

Training Infrastructure Open Weights Progress Optimum ROCm Hugging Face +2 more

5Hugging Face Blog·1mo ago·source ↗

Accelerate a World of LLMs on Hugging Face with NVIDIA NIM

NVIDIA NIM microservices are being integrated with Hugging Face to enable optimized inference deployment for a broad range of LLMs hosted on the Hub. The partnership allows developers to deploy Hugging Face models via NIM's containerized inference stack, leveraging NVIDIA's TensorRT-LLM and other optimizations. This expands the ecosystem of models accessible through NIM beyond NVIDIA's own catalog to the wider Hugging Face model repository.

Inference Economics Enterprise Deployment Patterns NVIDIA NIM NVIDIA TensorRT-LLM +2 more

4Hugging Face Blog·1mo ago·source ↗

Accelerated Inference with Optimum and Transformers Pipelines

Hugging Face announced integration between the Optimum library and the Transformers Pipelines API, enabling hardware-accelerated inference with minimal code changes. The integration targets deployment on specialized hardware backends such as ONNX Runtime, allowing users to swap in optimized inference engines transparently. This lowers the barrier to production-grade inference optimization for practitioners using the Hugging Face ecosystem.

Inference Economics Agent and Tool Ecosystem Optimum ONNX Transformers Pipelines +1 more

4Hugging Face Blog·1mo ago·source ↗

Accelerate your models with Optimum Intel and OpenVINO

Hugging Face's Optimum Intel library integrates with Intel's OpenVINO toolkit to accelerate inference of transformer models on Intel hardware. The post covers how to export models to OpenVINO IR format and run optimized inference pipelines. This targets deployment efficiency for NLP and vision models on CPU and other Intel accelerators.

Inference Economics Enterprise Deployment Patterns Hugging Face Intel OpenVINO +1 more

4Hugging Face Blog·1mo ago·source ↗

Accelerating Hugging Face Transformers with AWS Inferentia2

Hugging Face published a blog post detailing how to accelerate Transformer model inference using AWS Inferentia2, Amazon's second-generation ML inference chip. The post covers integration patterns between the Hugging Face ecosystem and the Neuron SDK for deploying models on Inferentia2 hardware. This represents a practical guide for enterprise and cloud-based inference deployment using dedicated AI accelerators.

Training Infrastructure Inference Economics AWS Inferentia2 Hugging Face Transformers Hugging Face +3 more

4Hugging Face Blog·1mo ago·source ↗

Accelerating SD Turbo and SDXL Turbo Inference with ONNX Runtime and Olive

This Hugging Face blog post details how to accelerate Stable Diffusion Turbo and SDXL Turbo inference using ONNX Runtime and Microsoft's Olive optimization toolkit. The post covers the workflow for converting and optimizing diffusion models for faster deployment. This is a practical inference optimization guide targeting practitioners deploying image generation models.

Inference Economics Agent and Tool Ecosystem Stable Diffusion Turbo SDXL Turbo Microsoft +3 more