INT4 Quantization
int4-quantization-21667f24·3 events·first seen 28d agoAliases: INT4 Quantization, INT8 Quantization
Co-occurring entities
More like this (12)
Recent events (3)
Q8-Chat: Efficient Generative AI on Intel Xeon via INT8 Quantization
Hugging Face and Intel demonstrate running quantized large language models (INT8/Q8) on Intel Xeon CPUs, branded as Q8-Chat. The post covers inference performance of quantized models on CPU hardware without requiring GPUs. This is relevant to inference economics and enterprise deployment, particularly for organizations without GPU infrastructure.
Accelerate StarCoder with Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding
Hugging Face and Intel demonstrate quantization (INT8/INT4) and speculative decoding techniques applied to StarCoder on Intel Xeon CPUs using the Optimum Intel library. The post covers practical inference acceleration workflows targeting CPU deployment of code generation models. This represents a concrete inference-economics use case for open-weight code models on commodity server hardware.
MobileMoE: Scaling Mixture-of-Experts for Sub-Billion Parameter On-Device Deployment
MobileMoE introduces a family of on-device MoE language models with 0.3–0.9B active parameters and 1.3–5.3B total parameters, targeting mobile deployment under memory and compute constraints. The authors derive an on-device MoE scaling law identifying a sweet spot of moderate sparsity with fine-grained and shared experts, then train models through a four-stage recipe including quantization-aware training on open-source data. Across 14 benchmarks, MobileMoE matches or exceeds leading dense on-device LLMs with 2–4× fewer inference FLOPs, and delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than dense baselines on commodity smartphones at comparable INT4 memory.