5arXiv cs.LG (Machine Learning)·23h ago

OrbitQuant: Data-agnostic quantization for image and video diffusion transformers

OrbitQuant is a new post-training quantization method for diffusion transformers that avoids the need for calibration data by quantizing activations in a normalized, rotated basis using a randomized permuted block-Hadamard rotation. A single Lloyd-Max codebook covers all timesteps, prompts, and layers for a given input dimension, and the same recipe transfers from image to video models without per-modality tuning. The method is evaluated on FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, claiming state-of-the-art PTQ results at several low-bit settings including W2A4 for image DiTs.

Inference Economics Multimodal Progress Lloyd-Max quantization OrbitQuant Wan 2.1 FLUX CogVideoX

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Hidden Cost Battle Shaping AI

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·May 19, 2026·source ↗

Memory-efficient Diffusion Transformers with Quanto and Diffusers

This Hugging Face blog post describes integrating the Quanto quantization library with the Diffusers framework to reduce memory requirements for diffusion transformer models. The approach enables running large image/video generation models on consumer-grade hardware by applying int8 and int4 quantization to model weights. The post covers practical implementation details and benchmarks showing memory savings for models like Flux and others in the diffusion transformer family.

Inference Economics Agent and Tool Ecosystem Quanto Linear Diffusion Transformer Hugging Face +3 more

6arXiv · cs.AI·May 26, 2026·source ↗

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

This paper introduces Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework for ultra-low-bit quantization of LLMs and Vision Transformers targeting edge deployment. ORP addresses the structural limitations of Power-of-Two (PoT) quantization by formulating quantization as a dual-basis geometric projection that synthesizes higher-resolution residual lattices using only shift-and-add operations, eliminating multipliers. At 3-bit (W3/A16), ORP achieves 6.10 perplexity on LLaMA-2-7B, competitive with MAC-intensive baselines like AWQ, while reducing full-model calibration time to ~15 minutes. RTL synthesis at 28nm confirms hardware efficiency by mitigating timing bottlenecks from dense multiplier trees.

Training Infrastructure Evaluation and Benchmarking ViT (Vision Transformer)Orthogonal Residual Projection AWQ +5 more

5Hugging Face Blog·May 19, 2026·source ↗

Exploring Quantization Backends in Diffusers

Hugging Face published a technical overview of quantization backends available in the Diffusers library for image and video generation models. The post covers integration with multiple quantization frameworks (likely bitsandbytes, GGUF, torchao, and similar) and their trade-offs for diffusion model inference. It targets practitioners seeking to reduce memory footprint and improve throughput when deploying diffusion models.

Inference Economics Agent and Tool Ecosystem torchao GGUF Hugging Face +2 more

7arXiv · cs.LG·May 28, 2026·source ↗

Ω-QVLA: Training-Free W4A4 Quantization for Full Vision-Language-Action Models Including Diffusion Action Heads

Omega-QVLA is a post-training quantization framework that compresses both the LLM backbone and the diffusion-based action head of VLA models to uniform W4A4 precision without mixed-precision schemes or fine-tuning. It combines composite SVD-Hadamard rotation for weight energy equalization with per-step DiT activation scaling to handle dynamic-range drift across denoising steps. On the LIBERO benchmark, it achieves 98.0% and 87.8% task success on Pi 0.5 and GR00T N1.5 respectively—matching or exceeding FP16 baselines—while reducing static memory footprint by 71.3%. Real-world manipulation experiments confirm the approach generalizes beyond simulation.

Inference Economics Agent and Tool Ecosystem Pi 0.5 SVD-Hadamard rotation LIBERO +6 more

6arXiv · cs.AI·May 26, 2026·source ↗

Channel-wise Vector Quantization (CVQ): A New Image Tokenization Paradigm with Next-Channel Prediction

Researchers introduce Channel-wise Vector Quantization (CVQ), which replaces conventional patch-wise discrete tokens with channel-wise tokens that represent an image as discrete levels of visual detail. Built on CVQ, the Channel-wise Autoregressive (CAR) model uses a 'next-channel prediction' objective, generating images by progressively refining from global structure to fine-grained attributes. CVQ achieves 100% codebook utilization with a 16K+ codebook and the CAR model scores 86.7 on DPG and 0.79 on GenEval for text-to-image generation. The approach offers a structural alternative to raster-order patch-based autoregressive image generation.

Frontier Model Releases Evaluation and Benchmarking Channel-wise Vector Quantization DPG Benchmark GenEval +4 more

4arXiv · cs.CL·41h ago·source ↗

LogbQuant: Logarithmic quantization with adjustable bases for language models

A new arXiv preprint introduces LogbQuant, a logarithmic quantization scheme with tunable bases designed to better capture common weight distributions in language models. The method targets the known weakness of uniform quantization in handling low-frequency, high-magnitude weights. At 4-bit precision, LogbQuant claims superior performance over asymmetric linear quantization at tensor-wise granularity, with moderate speedup and high memory savings suitable for consumer-grade GPU deployment.

Open Weights Progress Inference Economics LogbQuant

5Hugging Face Blog·May 19, 2026·source ↗

Overview of Natively Supported Quantization Schemes in 🤗 Transformers

This Hugging Face blog post surveys the quantization methods natively integrated into the Transformers library as of September 2023, covering schemes such as GPTQ, bitsandbytes (LLM.int8, NF4), and related techniques. It explains how each method works, their trade-offs in terms of memory reduction and inference speed, and how practitioners can apply them via the Transformers API. The post serves as a practical reference for deploying large language models under memory constraints.

Open Weights Progress Inference Economics NF4 Hugging Face Transformers Hugging Face +4 more

4Hugging Face Blog·May 19, 2026·source ↗

VQ-Diffusion: Vector Quantized Diffusion Models on Hugging Face

This Hugging Face blog post introduces VQ-Diffusion, a text-to-image generation approach that combines vector quantization with diffusion models. The method operates in a discrete latent space defined by a VQ-VAE codebook, applying the diffusion process to token sequences rather than continuous pixel or latent representations. The post likely covers integration into the Hugging Face diffusers ecosystem and demonstrates generation capabilities.

Agent and Tool Ecosystem Multimodal Progress VQ-VAE Hugging Face VQ-Diffusion +1 more

OrbitQuant: Data-agnostic quantization for image and video diffusion transformers

Related events (8)

5Hugging Face Blog·May 19, 2026·source ↗