Almanac
← Events
4Hugging Face Blog·1mo ago

Accelerating LLM Inference with TGI on Intel Gaudi

Hugging Face's Text Generation Inference (TGI) framework has added a backend for Intel Gaudi accelerators, enabling LLM inference on Intel's AI hardware. The integration allows users to deploy large language models on Gaudi hardware using TGI's serving infrastructure. This expands the hardware ecosystem for LLM inference beyond NVIDIA GPUs, offering an alternative accelerator option for enterprise deployments.

Related guides (4)

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Hugging Face's Text Generation Inference (TGI) now supports multiple inference backends, including NVIDIA TensorRT-LLM and vLLM, in addition to its native backend. This allows users to select the most appropriate backend for their hardware and workload without leaving the TGI ecosystem. The announcement positions TGI as a unified serving layer that abstracts over competing inference runtimes, potentially simplifying enterprise deployment workflows.

4Hugging Face Blog·1mo ago·source ↗

Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator

Hugging Face published a blog post detailing how to run text-generation pipelines on Intel's Gaudi 2 AI accelerator. The post covers integration between Hugging Face's text-generation tooling and Intel's Gaudi 2 hardware, positioning it as an alternative inference accelerator to NVIDIA GPUs. This is relevant to the growing ecosystem of non-NVIDIA AI inference hardware.

4Hugging Face Blog·1mo ago·source ↗

Faster Assisted Generation Support for Intel Gaudi

Hugging Face has published a blog post detailing assisted generation (speculative decoding) support optimized for Intel Gaudi accelerators. The post covers implementation details and performance improvements achieved by running assisted/speculative decoding on Gaudi hardware. This represents an infrastructure and inference optimization development relevant to non-NVIDIA AI accelerator deployment.

5Hugging Face Blog·1mo ago·source ↗

Hugging Face Text Generation Inference available for AWS Inferentia2

Hugging Face has announced that its Text Generation Inference (TGI) serving framework is now available for AWS Inferentia2 accelerators. This integration allows users to deploy large language models on AWS's custom AI chips using the TGI stack. The move extends TGI's hardware support beyond GPUs to specialized inference silicon, potentially offering cost and performance advantages for production LLM deployments.

4Hugging Face Blog·1mo ago·source ↗

Fast Inference on Large Language Models: BLOOMZ on Habana Gaudi2 Accelerator

This Hugging Face blog post covers deploying BLOOMZ, a large multilingual language model, on Intel's Habana Gaudi2 accelerator for inference. It benchmarks throughput and latency performance on Gaudi2 as an alternative to GPU-based inference. The post is part of ongoing work to demonstrate non-NVIDIA hardware options for large model deployment.

3Hugging Face Blog·1mo ago·source ↗

Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2

This Hugging Face blog post covers the deployment and acceleration of BridgeTower, a vision-language model, on Intel's Habana Gaudi2 AI accelerator hardware. The piece likely benchmarks inference throughput and training performance on Gaudi2 compared to other hardware. It represents a practical infrastructure and deployment case study for multimodal models on alternative AI accelerators.

5Hugging Face Blog·1mo ago·source ↗

Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

Hugging Face and Amazon Web Services have launched a dedicated LLM inference container for Amazon SageMaker, enabling optimized deployment of large language models on managed cloud infrastructure. The container is built on Hugging Face's Text Generation Inference (TGI) toolkit, which supports features like continuous batching, tensor parallelism, and quantization. This integration lowers the barrier for enterprise teams to deploy open-weight LLMs at scale on AWS without managing custom serving infrastructure.

3Hugging Face Blog·1mo ago·source ↗

Accelerating Protein Language Model ProtST on Intel Gaudi 2

A Hugging Face blog post details the acceleration of ProtST, a protein language model, on Intel's Gaudi 2 AI accelerator hardware. The post covers the technical integration and performance results of running this specialized biological ML model on Gaudi 2. This represents an intersection of domain-specific AI (protein modeling) and alternative AI hardware ecosystems.