Quantization in practice

PTQ basics (int8/int4, scale + zero point), the LLM.int8/GPTQ/AWQ family, the GGML/llama.cpp quantization family (q4_0..q6_K, q4_K_M vs q4_K_S, IQ-quants, imatrix), and quantization-aware training (STE, BitNet, QLoRA).

§1 PTQ basics — int8, int4, scales, zero points §2 The GGML quantization family — q4_0 to q6_K, and the q4_K_M naming §3 Quantization-aware training — STE, BitNet, QLoRA

§1 PTQ basics — int8, int4, scales, zero points
Quantization replaces fp16/bf16 weights with lower-bit integers (8, 5, 4, 3, 2, even 1.58 bits) plus a small number of fp16 metadata per block. The math is a simple linear map x ≈ scale · x_q (+ zero point). The hard part is choosing the scale: per-tensor is fast but ruined by outliers; per-block is the practical default; per-channel is somewhere in between. This section derives the formulas, runs a real C kernel comparing int8 and int4 round-trip error, and surveys the seminal post-training methods (LLM.int8, GPTQ, AWQ).
§2 The GGML quantization family — q4_0 to q6_K, and the q4_K_M naming
The exact byte-level layouts of every quantization format llama.cpp supports. Legacy q4_0/q4_1/q5_0/q5_1/q8_0 with 32-weight blocks. K-quants (q2_K..q6_K) with 256-weight super-blocks and 16-weight hierarchical sub-blocks. The q4_K_M / q4_K_S model naming — what tensors get bumped to higher precision. IQ-quants and the imatrix calibration file. The format that ships every Llama-3.3 70B GGUF on Hugging Face.
§3 Quantization-aware training — STE, BitNet, QLoRA
Quantization-aware training (QAT) puts the quantizer inside the training loop. The forward pass uses fake-quantized weights; the backward pass uses the straight-through estimator (STE) to pretend the quantizer is the identity. The result is a model that EXPECTS to be quantized and degrades less when deployed. BitNet pushes this to 1-1.58 bits trained from scratch. QLoRA combines 4-bit base weights with fp16 LoRA adapters so 70B models can be fine-tuned on a single 24GB GPU.

← ALL CHAPTERS