Almanac
← Events
5Hugging Face Blog·1mo ago

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

This Hugging Face blog post covers techniques for quantizing text embeddings to binary and scalar (int8) representations, enabling dramatically faster similarity search and reduced memory footprint. The post details how binary quantization can achieve ~40x memory reduction with Hamming distance search, while scalar quantization offers a middle ground between speed and accuracy. Practical implementation guidance is provided using Sentence Transformers and FAISS/USearch libraries, with benchmark results showing retrieval speed and accuracy tradeoffs.

Related guides (4)

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Overview of Natively Supported Quantization Schemes in 🤗 Transformers

This Hugging Face blog post surveys the quantization methods natively integrated into the Transformers library as of September 2023, covering schemes such as GPTQ, bitsandbytes (LLM.int8, NF4), and related techniques. It explains how each method works, their trade-offs in terms of memory reduction and inference speed, and how practitioners can apply them via the Transformers API. The post serves as a practical reference for deploying large language models under memory constraints.

5Hugging Face Blog·1mo ago·source ↗

Train 400x Faster Static Embedding Models with Sentence Transformers

Hugging Face's Sentence Transformers library introduces support for static embedding models that train up to 400x faster than transformer-based alternatives. Static embeddings use fixed token-level representations averaged or pooled without attention layers, dramatically reducing compute requirements. The post covers training methodology, trade-offs in embedding quality versus speed, and practical use cases where inference latency and training cost matter more than peak accuracy.

6Hugging Face Blog·1mo ago·source ↗

A Gentle Introduction to 8-bit Matrix Multiplication for Transformers at Scale using Hugging Face and bitsandbytes

This Hugging Face blog post introduces 8-bit quantization for large transformer models via integration of the bitsandbytes library with the transformers and accelerate libraries. It explains how LLM.int8() enables loading large models in 8-bit precision, significantly reducing GPU memory requirements without major accuracy degradation. The post covers the technical mechanics of mixed-precision decomposition and how practitioners can use the integration in practice.

5Hugging Face Blog·1mo ago·source ↗

Unlocking Longer Generation with Key-Value Cache Quantization

This Hugging Face blog post covers KV cache quantization as a technique to reduce memory consumption during LLM inference, enabling longer context generation without proportional VRAM increases. The post likely explains how quantizing the key-value cache (e.g., to INT8 or lower precision) trades minimal accuracy for significant memory savings. This is directly relevant to inference efficiency and long-context deployment patterns.

4Hugging Face Blog·1mo ago·source ↗

CPU Optimized Embeddings with Optimum Intel and fastRAG

Hugging Face and Intel demonstrate CPU-optimized embedding inference using Optimum Intel and fastRAG, targeting RAG pipeline acceleration without GPU hardware. The post covers quantization and optimization techniques that improve embedding throughput on Intel CPUs. This is relevant to inference economics and enterprise deployment patterns where GPU availability is constrained.

6Hugging Face Blog·1mo ago·source ↗

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Hugging Face published a blog post detailing the integration of 4-bit quantization via bitsandbytes into the Transformers library, enabling large language models to run on consumer-grade hardware. The post covers NF4 (NormalFloat4) data type and double quantization techniques from the QLoRA paper, which together reduce memory footprint significantly while preserving model quality. It demonstrates how users can load models like LLaMA in 4-bit precision and fine-tune them using QLoRA with minimal code changes.

5Hugging Face Blog·1mo ago·source ↗

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

Hugging Face published a blog post describing a method for fine-tuning large language models down to 1.58-bit precision, referencing the BitNet b1.58 quantization scheme. The post covers tooling and workflows that make extreme quantization more accessible via the Hugging Face ecosystem. This represents a practical guide to applying ternary-weight quantization ({-1, 0, 1}) to existing models through fine-tuning rather than training from scratch.

5Hugging Face Blog·1mo ago·source ↗

Memory-efficient Diffusion Transformers with Quanto and Diffusers

This Hugging Face blog post describes integrating the Quanto quantization library with the Diffusers framework to reduce memory requirements for diffusion transformer models. The approach enables running large image/video generation models on consumer-grade hardware by applying int8 and int4 quantization to model weights. The post covers practical implementation details and benchmarks showing memory savings for models like Flux and others in the diffusion transformer family.