Almanac
technique

LLM.int8

techniqueactivellm-int8-5b13eb10·2 events·first seen 28d ago

Aliases: LLM.int8, LLM.int8()

Co-occurring entities

More like this (12)

Recent events (2)

6Hugging Face Blog·28d ago·source ↗

A Gentle Introduction to 8-bit Matrix Multiplication for Transformers at Scale using Hugging Face and bitsandbytes

This Hugging Face blog post introduces 8-bit quantization for large transformer models via integration of the bitsandbytes library with the transformers and accelerate libraries. It explains how LLM.int8() enables loading large models in 8-bit precision, significantly reducing GPU memory requirements without major accuracy degradation. The post covers the technical mechanics of mixed-precision decomposition and how practitioners can use the integration in practice.

5Hugging Face Blog·28d ago·source ↗

Overview of Natively Supported Quantization Schemes in 🤗 Transformers

This Hugging Face blog post surveys the quantization methods natively integrated into the Transformers library as of September 2023, covering schemes such as GPTQ, bitsandbytes (LLM.int8, NF4), and related techniques. It explains how each method works, their trade-offs in terms of memory reduction and inference speed, and how practitioners can apply them via the Transformers API. The post serves as a practical reference for deploying large language models under memory constraints.