Almanac
product

Text Generation Inference

productactivetext-generation-inference-b6762d33·8 events·first seen 28d ago

Aliases: Text Generation Inference

Co-occurring entities

More like this (12)

Recent events (8)

6Hugging Face Blog·28d ago·source ↗

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Hugging Face's Text Generation Inference (TGI) now supports multiple inference backends, including NVIDIA TensorRT-LLM and vLLM, in addition to its native backend. This allows users to select the most appropriate backend for their hardware and workload without leaving the TGI ecosystem. The announcement positions TGI as a unified serving layer that abstracts over competing inference runtimes, potentially simplifying enterprise deployment workflows.

6Hugging Face Blog·28d ago·source ↗

TGI Multi-LoRA: Deploy Once, Serve 30 Models

Hugging Face's Text Generation Inference (TGI) introduces Multi-LoRA serving, enabling a single base model deployment to serve up to 30 fine-tuned LoRA adapters simultaneously. This approach reduces infrastructure costs by eliminating the need to deploy separate model instances per fine-tune. The feature targets enterprise use cases where multiple task-specific variants of a base model are needed in production.

4Hugging Face Blog·28d ago·source ↗

Benchmarking Text Generation Inference

Hugging Face published a benchmarking guide for Text Generation Inference (TGI), their production inference server. The post covers methodology for measuring throughput and latency under various load conditions, helping practitioners evaluate TGI performance for deployment decisions. It provides tooling and guidance for running reproducible benchmarks against TGI endpoints.

4Hugging Face Blog·28d ago·source ↗

Accelerating LLM Inference with TGI on Intel Gaudi

Hugging Face's Text Generation Inference (TGI) framework has added a backend for Intel Gaudi accelerators, enabling LLM inference on Intel's AI hardware. The integration allows users to deploy large language models on Gaudi hardware using TGI's serving infrastructure. This expands the hardware ecosystem for LLM inference beyond NVIDIA GPUs, offering an alternative accelerator option for enterprise deployments.

5Hugging Face Blog·28d ago·source ↗

From OpenAI to Open LLMs with Messages API on Hugging Face

Hugging Face's Text Generation Inference (TGI) now supports an OpenAI-compatible Messages API, enabling developers to switch from OpenAI models to open-weight LLMs with minimal code changes. The integration allows existing OpenAI SDK users to point their client at Hugging Face endpoints by changing only the base URL and model name. This lowers the migration barrier for teams wanting to self-host or use open models while retaining familiar tooling.

5Hugging Face Blog·28d ago·source ↗

Hugging Face Text Generation Inference available for AWS Inferentia2

Hugging Face has announced that its Text Generation Inference (TGI) serving framework is now available for AWS Inferentia2 accelerators. This integration allows users to deploy large language models on AWS's custom AI chips using the TGI stack. The move extends TGI's hardware support beyond GPUs to specialized inference silicon, potentially offering cost and performance advantages for production LLM deployments.

5Hugging Face Blog·28d ago·source ↗

Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

Hugging Face and Amazon Web Services have launched a dedicated LLM inference container for Amazon SageMaker, enabling optimized deployment of large language models on managed cloud infrastructure. The container is built on Hugging Face's Text Generation Inference (TGI) toolkit, which supports features like continuous batching, tensor parallelism, and quantization. This integration lowers the barrier for enterprise teams to deploy open-weight LLMs at scale on AWS without managing custom serving infrastructure.

5Hugging Face Blog·28d ago·source ↗

Goodbye cold boot - how we made LoRA Inference 300% faster

Hugging Face describes an optimization to their inference infrastructure that achieves a 300% speedup for LoRA adapter inference by enabling dynamic loading of adapters without cold boot penalties. The approach allows multiple LoRA adapters to be served efficiently from a single base model, reducing latency for adapter-based deployments. This is relevant to the growing ecosystem of fine-tuned model serving at scale.