Almanac
← Events
4Hugging Face Blog·1mo ago

Benchmarking Text Generation Inference

Hugging Face published a benchmarking guide for Text Generation Inference (TGI), their production inference server. The post covers methodology for measuring throughput and latency under various load conditions, helping practitioners evaluate TGI performance for deployment decisions. It provides tooling and guidance for running reproducible benchmarks against TGI endpoints.

Related guides (3)

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Hugging Face Text Generation Inference available for AWS Inferentia2

Hugging Face has announced that its Text Generation Inference (TGI) serving framework is now available for AWS Inferentia2 accelerators. This integration allows users to deploy large language models on AWS's custom AI chips using the TGI stack. The move extends TGI's hardware support beyond GPUs to specialized inference silicon, potentially offering cost and performance advantages for production LLM deployments.

4Hugging Face Blog·1mo ago·source ↗

Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator

Hugging Face published a blog post detailing how to run text-generation pipelines on Intel's Gaudi 2 AI accelerator. The post covers integration between Hugging Face's text-generation tooling and Intel's Gaudi 2 hardware, positioning it as an alternative inference accelerator to NVIDIA GPUs. This is relevant to the growing ecosystem of non-NVIDIA AI inference hardware.

6Hugging Face Blog·1mo ago·source ↗

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Hugging Face's Text Generation Inference (TGI) now supports multiple inference backends, including NVIDIA TensorRT-LLM and vLLM, in addition to its native backend. This allows users to select the most appropriate backend for their hardware and workload without leaving the TGI ecosystem. The announcement positions TGI as a unified serving layer that abstracts over competing inference runtimes, potentially simplifying enterprise deployment workflows.

4Hugging Face Blog·1mo ago·source ↗

Accelerating LLM Inference with TGI on Intel Gaudi

Hugging Face's Text Generation Inference (TGI) framework has added a backend for Intel Gaudi accelerators, enabling LLM inference on Intel's AI hardware. The integration allows users to deploy large language models on Gaudi hardware using TGI's serving infrastructure. This expands the hardware ecosystem for LLM inference beyond NVIDIA GPUs, offering an alternative accelerator option for enterprise deployments.

5Hugging Face Blog·1mo ago·source ↗

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Hugging Face introduces TTS Arena, a community-driven evaluation platform for text-to-speech models modeled after the LLM Chatbot Arena approach. Users listen to audio samples from competing TTS systems and vote on quality, generating Elo-based rankings. The platform aims to provide a more ecologically valid benchmark than existing automated metrics, which often fail to capture human perceptual preferences. Initial results surface rankings across open and proprietary TTS models.

3Hugging Face Blog·1mo ago·source ↗

An Overview of Inference Solutions on Hugging Face

Hugging Face published a blog post surveying its inference product offerings as of late 2022. The post covers the range of hosted and API-based inference solutions available on the platform, aimed at helping developers choose appropriate deployment paths. This serves as a reference overview of Hugging Face's inference infrastructure ecosystem at that time.

4Hugging Face Blog·1mo ago·source ↗

Deploy MusicGen in no time with Inference Endpoints

Hugging Face published a guide on deploying Meta's MusicGen model as a production API using Hugging Face Inference Endpoints. The post covers custom inference handler setup, containerization, and API integration patterns for audio generation workloads. It demonstrates a practical deployment path for generative audio models outside of research environments.

5Hugging Face Blog·1mo ago·source ↗

Assisted Generation: a new direction toward low-latency text generation

Hugging Face introduces assisted generation (speculative decoding) as a practical technique for reducing LLM inference latency. The approach uses a smaller draft model to propose token candidates that a larger model then verifies in parallel, enabling multiple tokens to be accepted per forward pass. The blog post explains the mechanism and demonstrates integration into the Hugging Face Transformers library.