Almanac
← Events
6Hugging Face Blog·1mo ago

TGI Multi-LoRA: Deploy Once, Serve 30 Models

Hugging Face's Text Generation Inference (TGI) introduces Multi-LoRA serving, enabling a single base model deployment to serve up to 30 fine-tuned LoRA adapters simultaneously. This approach reduces infrastructure costs by eliminating the need to deploy separate model instances per fine-tune. The feature targets enterprise use cases where multiple task-specific variants of a base model are needed in production.

Related guides (4)

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Hugging Face's Text Generation Inference (TGI) now supports multiple inference backends, including NVIDIA TensorRT-LLM and vLLM, in addition to its native backend. This allows users to select the most appropriate backend for their hardware and workload without leaving the TGI ecosystem. The announcement positions TGI as a unified serving layer that abstracts over competing inference runtimes, potentially simplifying enterprise deployment workflows.

5Hugging Face Blog·1mo ago·source ↗

Goodbye cold boot - how we made LoRA Inference 300% faster

Hugging Face describes an optimization to their inference infrastructure that achieves a 300% speedup for LoRA adapter inference by enabling dynamic loading of adapters without cold boot penalties. The approach allows multiple LoRA adapters to be served efficiently from a single base model, reducing latency for adapter-based deployments. This is relevant to the growing ecosystem of fine-tuned model serving at scale.

4Hugging Face Blog·1mo ago·source ↗

Deploy LLMs with Hugging Face Inference Endpoints

Hugging Face published a guide on deploying large language models using their Inference Endpoints service. The post covers how to set up scalable, production-ready LLM deployments with minimal infrastructure overhead. It targets developers looking to move from experimentation to hosted inference without managing raw compute.

4Hugging Face Blog·1mo ago·source ↗

LoRA Training Scripts of the World, Unite!

Hugging Face published a blog post consolidating and comparing advanced LoRA fine-tuning scripts for Stable Diffusion XL, covering techniques such as pivotal tuning, custom captions, and various regularization strategies. The post aims to unify fragmented community training approaches into a more coherent set of best practices. It serves as a practical guide for practitioners fine-tuning SDXL models with LoRA adapters.

4Hugging Face Blog·1mo ago·source ↗

Accelerating LLM Inference with TGI on Intel Gaudi

Hugging Face's Text Generation Inference (TGI) framework has added a backend for Intel Gaudi accelerators, enabling LLM inference on Intel's AI hardware. The integration allows users to deploy large language models on Gaudi hardware using TGI's serving infrastructure. This expands the hardware ecosystem for LLM inference beyond NVIDIA GPUs, offering an alternative accelerator option for enterprise deployments.

6Hugging Face Blog·1mo ago·source ↗

Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU

Hugging Face demonstrates a method for running RLHF fine-tuning on 20-billion-parameter language models using a single 24GB consumer GPU by combining TRL and PEFT (parameter-efficient fine-tuning). The approach uses techniques like LoRA and quantization to dramatically reduce memory requirements. This lowers the hardware barrier for RLHF experimentation from multi-GPU server setups to consumer-grade hardware.

5Hugging Face Blog·1mo ago·source ↗

Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

Hugging Face and Amazon Web Services have launched a dedicated LLM inference container for Amazon SageMaker, enabling optimized deployment of large language models on managed cloud infrastructure. The container is built on Hugging Face's Text Generation Inference (TGI) toolkit, which supports features like continuous batching, tensor parallelism, and quantization. This integration lowers the barrier for enterprise teams to deploy open-weight LLMs at scale on AWS without managing custom serving infrastructure.

4Hugging Face Blog·1mo ago·source ↗

Optimizing your LLM in production

A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.