Almanac
← Events
5Hugging Face Blog·1mo ago

Hugging Face Text Generation Inference available for AWS Inferentia2

Hugging Face has announced that its Text Generation Inference (TGI) serving framework is now available for AWS Inferentia2 accelerators. This integration allows users to deploy large language models on AWS's custom AI chips using the TGI stack. The move extends TGI's hardware support beyond GPUs to specialized inference silicon, potentially offering cost and performance advantages for production LLM deployments.

Related guides (4)

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Accelerating Hugging Face Transformers with AWS Inferentia2

Hugging Face published a blog post detailing how to accelerate Transformer model inference using AWS Inferentia2, Amazon's second-generation ML inference chip. The post covers integration patterns between the Hugging Face ecosystem and the Neuron SDK for deploying models on Inferentia2 hardware. This represents a practical guide for enterprise and cloud-based inference deployment using dedicated AI accelerators.

5Hugging Face Blog·1mo ago·source ↗

Deploy models on AWS Inferentia2 from Hugging Face

Hugging Face has announced support for deploying models on AWS Inferentia2 via Hugging Face Inference Endpoints. The integration allows users to deploy popular open-weight models on AWS's custom ML accelerator chips directly from the Hugging Face Hub. This expands the hardware options available for cost-effective inference beyond standard GPU instances.

4Hugging Face Blog·1mo ago·source ↗

Accelerating LLM Inference with TGI on Intel Gaudi

Hugging Face's Text Generation Inference (TGI) framework has added a backend for Intel Gaudi accelerators, enabling LLM inference on Intel's AI hardware. The integration allows users to deploy large language models on Gaudi hardware using TGI's serving infrastructure. This expands the hardware ecosystem for LLM inference beyond NVIDIA GPUs, offering an alternative accelerator option for enterprise deployments.

5Hugging Face Blog·1mo ago·source ↗

Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

Hugging Face and Amazon Web Services have launched a dedicated LLM inference container for Amazon SageMaker, enabling optimized deployment of large language models on managed cloud infrastructure. The container is built on Hugging Face's Text Generation Inference (TGI) toolkit, which supports features like continuous batching, tensor parallelism, and quantization. This integration lowers the barrier for enterprise teams to deploy open-weight LLMs at scale on AWS without managing custom serving infrastructure.

4Hugging Face Blog·1mo ago·source ↗

Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator

Hugging Face published a blog post detailing how to run text-generation pipelines on Intel's Gaudi 2 AI accelerator. The post covers integration between Hugging Face's text-generation tooling and Intel's Gaudi 2 hardware, positioning it as an alternative inference accelerator to NVIDIA GPUs. This is relevant to the growing ecosystem of non-NVIDIA AI inference hardware.

3Hugging Face Blog·1mo ago·source ↗

Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia

This Hugging Face blog post describes how to deploy BERT models on AWS Inferentia chips using the Hugging Face Transformers library and Amazon SageMaker. It covers the workflow for compiling models with AWS Neuron SDK and running optimized inference on Inferentia hardware. The post targets practitioners looking to reduce inference costs and latency for transformer-based NLP workloads.

6Hugging Face Blog·1mo ago·source ↗

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Hugging Face's Text Generation Inference (TGI) now supports multiple inference backends, including NVIDIA TensorRT-LLM and vLLM, in addition to its native backend. This allows users to select the most appropriate backend for their hardware and workload without leaving the TGI ecosystem. The announcement positions TGI as a unified serving layer that abstracts over competing inference runtimes, potentially simplifying enterprise deployment workflows.

4Hugging Face Blog·1mo ago·source ↗

Benchmarking Text Generation Inference

Hugging Face published a benchmarking guide for Text Generation Inference (TGI), their production inference server. The post covers methodology for measuring throughput and latency under various load conditions, helping practitioners evaluate TGI performance for deployment decisions. It provides tooling and guidance for running reproducible benchmarks against TGI endpoints.