Almanac
← Events
6Hugging Face Blog·1mo ago

Making LLMs lighter with AutoGPTQ and transformers

Hugging Face announces native integration of AutoGPTQ into the transformers library, enabling 4-bit quantized inference for large language models. The integration allows users to load and run GPTQ-quantized models directly through the standard transformers API with minimal code changes. This lowers the hardware barrier for deploying LLMs by significantly reducing VRAM requirements while maintaining competitive performance.

Related guides (4)

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Hugging Face published a blog post detailing the integration of 4-bit quantization via bitsandbytes into the Transformers library, enabling large language models to run on consumer-grade hardware. The post covers NF4 (NormalFloat4) data type and double quantization techniques from the QLoRA paper, which together reduce memory footprint significantly while preserving model quality. It demonstrates how users can load models like LLaMA in 4-bit precision and fine-tune them using QLoRA with minimal code changes.

5Hugging Face Blog·1mo ago·source ↗

Overview of Natively Supported Quantization Schemes in 🤗 Transformers

This Hugging Face blog post surveys the quantization methods natively integrated into the Transformers library as of September 2023, covering schemes such as GPTQ, bitsandbytes (LLM.int8, NF4), and related techniques. It explains how each method works, their trade-offs in terms of memory reduction and inference speed, and how practitioners can apply them via the Transformers API. The post serves as a practical reference for deploying large language models under memory constraints.

6Hugging Face Blog·1mo ago·source ↗

A Gentle Introduction to 8-bit Matrix Multiplication for Transformers at Scale using Hugging Face and bitsandbytes

This Hugging Face blog post introduces 8-bit quantization for large transformer models via integration of the bitsandbytes library with the transformers and accelerate libraries. It explains how LLM.int8() enables loading large models in 8-bit precision, significantly reducing GPU memory requirements without major accuracy degradation. The post covers the technical mechanics of mixed-precision decomposition and how practitioners can use the integration in practice.

5Hugging Face Blog·1mo ago·source ↗

Transformers Backend Integration in SGLang

Hugging Face has announced an integration that allows SGLang, a high-performance LLM serving framework, to use the Transformers library as a backend. This enables models supported by Transformers to be served through SGLang's inference engine, combining SGLang's optimized serving capabilities with the broad model coverage of the Transformers ecosystem. The integration lowers the barrier for deploying a wide range of models with production-grade inference infrastructure.

4Hugging Face Blog·1mo ago·source ↗

Accelerating LLM Inference with TGI on Intel Gaudi

Hugging Face's Text Generation Inference (TGI) framework has added a backend for Intel Gaudi accelerators, enabling LLM inference on Intel's AI hardware. The integration allows users to deploy large language models on Gaudi hardware using TGI's serving infrastructure. This expands the hardware ecosystem for LLM inference beyond NVIDIA GPUs, offering an alternative accelerator option for enterprise deployments.

5Hugging Face Blog·1mo ago·source ↗

Introducing AutoRound: Intel's Advanced Quantization for LLMs and VLMs

Intel has released AutoRound, an advanced quantization technique for large language models and vision-language models, announced via the Hugging Face blog. AutoRound targets efficient low-bit quantization to reduce model size and inference costs while preserving accuracy. The tool is positioned as a production-ready quantization solution integrated with the Hugging Face ecosystem.

4Hugging Face Blog·1mo ago·source ↗

Optimizing your LLM in production

A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.

4Hugging Face Blog·1mo ago·source ↗

Getting Started with Transformers on Habana Gaudi

This Hugging Face blog post introduces integration between the Transformers library and Habana Gaudi AI accelerators. It provides a practical guide for running transformer model training and inference on Gaudi hardware as an alternative to GPU-based infrastructure. The post signals growing ecosystem support for non-NVIDIA AI accelerator hardware.