§1
PTQ basics — int8, int4, scales, zero points Quantization replaces fp16/bf16 weights with lower-bit integers (8, 5, 4, 3, 2, even 1.58 bits) plus a small number of fp16 metadata per block. The math is a simple linear map x ≈ scale · x_q (+ zero point). The hard part is choosing the scale: per-tensor is fast but ruined by outliers; per-block is the practical default; per-channel is somewhere in between. This section derives the formulas, runs a real C kernel comparing int8 and int4 round-trip error, and surveys the seminal post-training methods (LLM.int8, GPTQ, AWQ).
§2
The GGML quantization family — q4_0 to q6_K, and the q4_K_M naming The exact byte-level layouts of every quantization format llama.cpp supports. Legacy q4_0/q4_1/q5_0/q5_1/q8_0 with 32-weight blocks. K-quants (q2_K..q6_K) with 256-weight super-blocks and 16-weight hierarchical sub-blocks. The q4_K_M / q4_K_S model naming — what tensors get bumped to higher precision. IQ-quants and the imatrix calibration file. The format that ships every Llama-3.3 70B GGUF on Hugging Face.
§3
Quantization-aware training — STE, BitNet, QLoRA Quantization-aware training (QAT) puts the quantizer inside the training loop. The forward pass uses fake-quantized weights; the backward pass uses the straight-through estimator (STE) to pretend the quantizer is the identity. The result is a model that EXPECTS to be quantized and degrades less when deployed. BitNet pushes this to 1-1.58 bits trained from scratch. QLoRA combines 4-bit base weights with fp16 LoRA adapters so 70B models can be fine-tuned on a single 24GB GPU.
← ALL CHAPTERS