4arXiv cs.CL (Computation and Language)·42h ago

LogbQuant: Logarithmic quantization with adjustable bases for language models

A new arXiv preprint introduces LogbQuant, a logarithmic quantization scheme with tunable bases designed to better capture common weight distributions in language models. The method targets the known weakness of uniform quantization in handling low-frequency, high-magnitude weights. At 4-bit precision, LogbQuant claims superior performance over asymmetric linear quantization at tensor-wise granularity, with moderate speedup and high memory savings suitable for consumer-grade GPU deployment.

Open Weights Progress Inference Economics LogbQuant

Related guides (2)

Open Weights ProgressTopic guide

Open Weights Progress: How Free AI Models Caught Up to the Frontier

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Hidden Cost Battle Shaping AI

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·May 19, 2026·source ↗

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

Hugging Face published a blog post describing a method for fine-tuning large language models down to 1.58-bit precision, referencing the BitNet b1.58 quantization scheme. The post covers tooling and workflows that make extreme quantization more accessible via the Hugging Face ecosystem. This represents a practical guide to applying ternary-weight quantization ({-1, 0, 1}) to existing models through fine-tuning rather than training from scratch.

Open Weights Progress Inference Economics Transformers 1.58-bit quantization Hugging Face +1 more

6Hugging Face Blog·May 19, 2026·source ↗

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Hugging Face published a blog post detailing the integration of 4-bit quantization via bitsandbytes into the Transformers library, enabling large language models to run on consumer-grade hardware. The post covers NF4 (NormalFloat4) data type and double quantization techniques from the QLoRA paper, which together reduce memory footprint significantly while preserving model quality. It demonstrates how users can load models like LLaMA in 4-bit precision and fine-tune them using QLoRA with minimal code changes.

Open Weights Progress Inference Economics Transformers NF4 (NormalFloat4)QLoRA +4 more

5Hugging Face Blog·May 19, 2026·source ↗

Introducing AutoRound: Intel's Advanced Quantization for LLMs and VLMs

Intel has released AutoRound, an advanced quantization technique for large language models and vision-language models, announced via the Hugging Face blog. AutoRound targets efficient low-bit quantization to reduce model size and inference costs while preserving accuracy. The tool is positioned as a production-ready quantization solution integrated with the Hugging Face ecosystem.

Open Weights Progress Inference Economics Hugging Face AutoRound Intel +1 more

7arXiv · cs.LG·May 28, 2026·source ↗

Ω-QVLA: Training-Free W4A4 Quantization for Full Vision-Language-Action Models Including Diffusion Action Heads

Omega-QVLA is a post-training quantization framework that compresses both the LLM backbone and the diffusion-based action head of VLA models to uniform W4A4 precision without mixed-precision schemes or fine-tuning. It combines composite SVD-Hadamard rotation for weight energy equalization with per-step DiT activation scaling to handle dynamic-range drift across denoising steps. On the LIBERO benchmark, it achieves 98.0% and 87.8% task success on Pi 0.5 and GR00T N1.5 respectively—matching or exceeding FP16 baselines—while reducing static memory footprint by 71.3%. Real-world manipulation experiments confirm the approach generalizes beyond simulation.

Inference Economics Agent and Tool Ecosystem Pi 0.5 SVD-Hadamard rotation LIBERO +6 more

5Hugging Face Blog·May 19, 2026·source ↗

Overview of Natively Supported Quantization Schemes in 🤗 Transformers

This Hugging Face blog post surveys the quantization methods natively integrated into the Transformers library as of September 2023, covering schemes such as GPTQ, bitsandbytes (LLM.int8, NF4), and related techniques. It explains how each method works, their trade-offs in terms of memory reduction and inference speed, and how practitioners can apply them via the Transformers API. The post serves as a practical reference for deploying large language models under memory constraints.

Open Weights Progress Inference Economics NF4 Hugging Face Transformers Hugging Face +4 more

5arXiv · cs.LG·24h ago·source ↗

OrbitQuant: Data-agnostic quantization for image and video diffusion transformers

OrbitQuant is a new post-training quantization method for diffusion transformers that avoids the need for calibration data by quantizing activations in a normalized, rotated basis using a randomized permuted block-Hadamard rotation. A single Lloyd-Max codebook covers all timesteps, prompts, and layers for a given input dimension, and the same recipe transfers from image to video models without per-modality tuning. The method is evaluated on FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, claiming state-of-the-art PTQ results at several low-bit settings including W2A4 for image DiTs.

Inference Economics Multimodal Progress Lloyd-Max quantization OrbitQuant Wan 2.1 +2 more

4Hugging Face Blog·May 19, 2026·source ↗

Optimizing your LLM in production

A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.

Inference Economics Enterprise Deployment Patterns Hugging Face

6Hugging Face Blog·May 19, 2026·source ↗

Falcon-Edge: 1.58-bit Quantized Language Model Series from TII

Technology Innovation Institute (TII) has released Falcon-Edge, a series of language models operating at 1.58-bit precision, targeting edge deployment scenarios. The models are designed to be fine-tunable despite extreme quantization, positioning them as practical options for resource-constrained environments. This release extends the Falcon model family into the ultra-low-bit regime, following broader industry interest in BitNet-style ternary weight models.

Frontier Model Releases Open Weights Progress BitNet 1.58-bit quantization Falcon-Edge +3 more

LogbQuant: Logarithmic quantization with adjustable bases for language models

Related events (8)

5Hugging Face Blog·May 19, 2026·source ↗