Entity · technique

GPTQ

techniqueactivegptq-bc8cdf24·3 events·first seen May 19, 2026

Aliases: GPTQ

Co-occurring entities

Hugging Face Sophgo AWQ Qwen Llama TPU-MLIR MLIR LLM-TPU Transformers AutoGPTQ NF4 Hugging Face Transformers bitsandbytes LLM.int8

More like this (12)

AutoGPTQ GPT GPTs WebGPT GPT-next GPT Builder GPTs are GPTs GPQA GPT-4 GPT-f GPT-OSS GPT-1

Recent events (3)

4arXiv · cs.CL·Jul 20, 2026·source ↗

MLIR-based compilation method for LLM inference on specialized hardware

Researchers present an MLIR-based compiler pipeline for deploying large language models on AI accelerators, using two dialect layers (TopOp for framework-agnostic graph representation and TpuOp for hardware-specific lowering). The method splits each Transformer layer into three static compilation stages (prefill, prefill_kv, decode) to handle the distinct computational profiles of prompt processing and autoregressive generation. The approach is implemented in the open-source TPU-MLIR compiler and LLM-TPU project, supporting Qwen, Llama, InternVL, and MiniCPM-V families with GPTQ, AWQ, and AutoRound quantization.

Training Infrastructure Inference Economics Sophgo AWQ Qwen +5 more

6Hugging Face Blog·May 19, 2026·source ↗

Making LLMs lighter with AutoGPTQ and transformers

Hugging Face announces native integration of AutoGPTQ into the transformers library, enabling 4-bit quantized inference for large language models. The integration allows users to load and run GPTQ-quantized models directly through the standard transformers API with minimal code changes. This lowers the hardware barrier for deploying LLMs by significantly reducing VRAM requirements while maintaining competitive performance.

Open Weights Progress Inference Economics Transformers Hugging Face AutoGPTQ +2 more

5Hugging Face Blog·May 19, 2026·source ↗

Overview of Natively Supported Quantization Schemes in 🤗 Transformers

This Hugging Face blog post surveys the quantization methods natively integrated into the Transformers library as of September 2023, covering schemes such as GPTQ, bitsandbytes (LLM.int8, NF4), and related techniques. It explains how each method works, their trade-offs in terms of memory reduction and inference speed, and how practitioners can apply them via the Transformers API. The post serves as a practical reference for deploying large language models under memory constraints.

Open Weights Progress Inference Economics NF4 Hugging Face Transformers Hugging Face +4 more