Entity · technique

INT4 Quantization

techniqueactiveint4-quantization-21667f24·3 events·first seen May 19, 2026

Aliases: INT4 Quantization, INT8 Quantization

Co-occurring entities

Intel Xeon Hugging Face Intel MobileLLM-Pro OLMoE-1B-7B Mixture of Experts MobileMoE on-device MoE scaling law quantization-aware training Q8-Chat speculative decoding StarCoder2 Optimum-Intel

More like this (12)

INT4 quantisation 1.58-bit quantization quantization binary quantization W4A4 quantization scalar quantization Power-of-Two (PoT) Quantization Vector Quantization KV Cache Quantization quantization-induced degradation Channel-wise Vector Quantization NF4 (NormalFloat4)

Recent events (3)

7arXiv · cs.CL·May 27, 2026·source ↗

MobileMoE: Scaling Mixture-of-Experts for Sub-Billion Parameter On-Device Deployment

MobileMoE introduces a family of on-device MoE language models with 0.3–0.9B active parameters and 1.3–5.3B total parameters, targeting mobile deployment under memory and compute constraints. The authors derive an on-device MoE scaling law identifying a sweet spot of moderate sparsity with fine-grained and shared experts, then train models through a four-stage recipe including quantization-aware training on open-source data. Across 14 benchmarks, MobileMoE matches or exceeds leading dense on-device LLMs with 2–4× fewer inference FLOPs, and delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than dense baselines on commodity smartphones at comparable INT4 memory.

Training Infrastructure Frontier Model Releases MobileLLM-Pro OLMoE-1B-7B INT4 Quantization +7 more

5Hugging Face Blog·May 19, 2026·source ↗

Q8-Chat: Efficient Generative AI on Intel Xeon via INT8 Quantization

Hugging Face and Intel demonstrate running quantized large language models (INT8/Q8) on Intel Xeon CPUs, branded as Q8-Chat. The post covers inference performance of quantized models on CPU hardware without requiring GPUs. This is relevant to inference economics and enterprise deployment, particularly for organizations without GPU infrastructure.

Inference Economics Enterprise Deployment Patterns Q8-Chat Intel Xeon INT4 Quantization +2 more

4Hugging Face Blog·May 19, 2026·source ↗

Accelerate StarCoder with Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

Hugging Face and Intel demonstrate quantization (INT8/INT4) and speculative decoding techniques applied to StarCoder on Intel Xeon CPUs using the Optimum Intel library. The post covers practical inference acceleration workflows targeting CPU deployment of code generation models. This represents a concrete inference-economics use case for open-weight code models on commodity server hardware.

Open Weights Progress Inference Economics speculative decoding Intel Xeon INT4 Quantization +4 more