PTQ basics — int8, int4, scales, zero points
A Llama 2 70B model in fp16 is 140 GB. At 4-bit weights it’s 35 GB — fits on a single H100. At 2-bit weights it’s 17.5 GB — fits on a consumer 24GB GPU with room to spare. The economics of running an LLM are dominated by this number: bits per weight. The technique that achieves it — replacing high-precision weights with low-bit integers plus a small per-block scale — is called quantization. This section covers the basic linear-map math (scale + optional zero point), the choice of per-tensor vs per-block vs per-channel granularity, the outlier problem that drove LLM.int8 → GPTQ → AWQ, and a kernel that measures the round-trip error of each scheme. The next two sections cover the GGML format family (q4_0 through q6_K, the most-deployed quantization scheme in the world) and quantization-aware training.
The quantization linear map
For a single weight value x ∈ ℝ, the standard quantization map is:
The symmetric form has one parameter (scale) per block; the asymmetric form has two (scale and min). Both are linear maps from a real-valued range onto a discrete integer grid. The cost: b bits per weight for the integer + the per-block fp16 overhead.
Scale is the load-bearing parameter. For symmetric int8 with weights in x ∈ [-W_max, W_max], the scale is typically scale = W_max / 127 so the largest weight maps to ±127 (using the full int8 range).
Zero point is the offset that allows the integer grid to be shifted: x ≈ scale · x_q + min. Asymmetric quantization is more accurate for non-symmetric distributions (e.g., post-ReLU activations always ≥ 0) at the cost of one more fp16 per block.
Per-tensor vs per-block vs per-channel
The granularity vs overhead trade-off is the central one in low-bit quantization. Coarser granularity (per-tensor) saves overhead but suffers from outliers; finer granularity (per-block) eliminates the outlier problem but adds overhead. For int4 (4 bits per weight, the regime most LLM deployments care about), the overhead of per-block scales (0.5 bpw at K=32) is small enough that per-block dominates.
Symmetric: x_q = round(x / scale); x_dq = scale · x_q. Signed integer; for int4 this means x_q ∈ [-8, 7].
Asymmetric: x_q = round((x - min) / scale); x_dq = scale · x_q + min. Unsigned integer; for int4 this means x_q ∈ [0, 15].
Scale: the slope of the linear map — how much real-value range each integer step covers. For a block of weights with absmax W: scale = W / 7 (symmetric int4) or W / 15 (asymmetric int4). Set once per block.
Zero point (asymmetric only): the offset — the real-value that integer 0 corresponds to. For asymmetric, min is one common parameterisation (the real-value at integer 0 is exactly min). For weights centered around zero, the zero point is small and asymmetric only wastes 1 bit of overhead.
When asymmetric pays off: when the value distribution is skewed (e.g., post-ReLU activations are always ≥ 0; LayerNorm γ values cluster near 1; some weight columns have a strong sign bias). For these, asymmetric uses the full integer range; symmetric wastes half of it on the empty side of zero.
q4_0 is symmetric (no zero point — just scale). q4_1 is asymmetric (scale + min). q4_1 is more accurate at the cost of 1 more fp16 per block (~0.5 extra bpw). For modern LLM weights (zero-mean Gaussian distribution), q4_0’s symmetric assumption is exactly right and the simpler format wins.
The outlier problem (LLM.int8)
Quantization sounds simple. What broke when people first tried it on real LLMs?
Dettmers 2022 “LLM.int8()” noticed that as transformer models scale past ~6.7B parameters, a small number of “outlier features” emerge in the activations — specific feature dimensions with magnitudes 5-100× larger than the rest. These outliers concentrate in only a few (~6-12) feature dimensions out of d_model = 4096+, but they carry a disproportionate share of the information.
The kernel below demonstrates the outlier problem on synthetic weights. It runs three schemes on the same input (with a small number of injected outliers) and reports the dequantization RMSE and the relative error of the (W · x) dot product:
if (q > 127) q = 127;
if (q < -128) q = -128;
w_dq[i] = q * scale;
}
}
}
/* ---- per-block symmetric int4 (q4_0-style symmetric) ------------------- */
static void per_block_int4(const float* w, float* w_dq, int n, int blk) {
for (int b = 0; b < n; b += blk) {
int end = b + blk; if (end > n) end = n;
float absmax = 0;
for (int i = b; i < end; i++) {
float a = fabsf(w[i]);
if (a > absmax) absmax = a;
}
/* int4 signed: range [-8, 7]. We use -8 as the negative extreme
(slightly asymmetric — same as q4_0's choice). */
float scale = absmax / 8.0f;
if (scale == 0) scale = 1.0f;
for (int i = b; i < end; i++) {Output:
Weight tensor: N = 8192, std = ~0.02, max |w| = 0.335 (after outliers)
scheme bits/wt dequant RMSE (W·x) rel error
fp32 (baseline) 32.0 0.000e+00 0.000e+00
per-tensor int8 8.0 7.533e-04 3.688e-02
per-block int8 (B=32) 8.5 1.282e-04 2.675e-03
per-block int4 (B=32) ≈ q4_0 4.5 2.285e-03 4.768e-02
Three things to read off:
- Per-tensor int8 has a 3.7% dot-product error. The single outlier set the scale to ~0.335 / 127 ≈ 2.6e-3, but most weights are ~0.02 — so most weights only get ~8 distinct quantization levels. Bad.
- Per-block int8 is 14× more accurate at almost the same bit count (8.5 vs 8). The outlier’s contamination is confined to one 32-weight block; the other 8160 weights get their own scales.
- Per-block int4 is comparable to per-tensor int8 (4.8% vs 3.7%) despite using HALF the bits per weight. The per-block scale is doing more work than the extra 4 bits would have.
This is why all production low-bit quantization is per-block. The format that wins isn’t “more bits” — it’s “smaller blocks.”
The setup: int8 has 256 quantization levels per weight; int4 has 16 levels per weight. Naively int8 should be 16× more accurate.
What kills per-tensor quantization: the scale = absmax / 127 is set by the LARGEST weight in the tensor. If there’s a single outlier 10× bigger than the typical weight (which is the empirical reality for LLM weights past ~6B params), the typical weight maps to integer values in [-12, 12] — using ~5 bits of effective resolution despite having 8 bits available. The outlier “wastes” most of the integer range.
What blockwise fixes: each block of 32 weights computes its own scale from its own absmax. If only 0.1% of weights are outliers, ~ 99% of blocks have a “normal” scale and use the full integer range. The few outlier-containing blocks pay the contamination cost, but it’s isolated.
For per-block int4 at block size 32:
- Each block’s scale absorbs the dynamic range mismatch.
- The 16 quantization levels are spread across the block’s actual range, not the tensor’s worst-case range.
- Effective resolution per weight is ~3.5 bits within each block’s range — which is enough for fp16-distribution weights.
Why “more bits per weight” without finer blocking doesn’t help: the failure mode isn’t lack of quantization levels — it’s a SCALE mismatch. Adding more bits doesn’t fix the scale; it just gives you more levels at the wrong scale. Smaller blocks change the scale per group, which is what actually matters.
The cost of blockwise: 1 fp16 scale per K weights = 16/K bpw overhead. K=32: 0.5 bpw overhead. K=256: 0.0625 bpw. Modern formats use K=32 (q4_0, q5_0) for high-quality 4/5-bit, or K=256 with hierarchical sub-block scales (q4_K etc.) for slightly lower bpw at similar quality.
The PTQ method family
Three landmark post-training quantization methods, in chronological order:
GPTQ processes weights one column at a time. For each column, it quantizes the column’s entries while simultaneously adjusting later columns to compensate for the quantization error. The adjustment uses the per-layer Hessian (computed from calibration activations) to know which directions are loss-sensitive. Result: 4-bit quantization with only ~0.5 perplexity loss on Llama 7B.
AWQ is conceptually simpler: it scans calibration activations and identifies which weight channels are “important” (those that multiply large activations). It then divides those weight channels by a per-channel factor s and multiplies the corresponding activation column by s at inference — the math is unchanged, but the quantization grid now spans the important weights with higher effective resolution. Result: comparable to GPTQ at lower compute cost.
Why outliers break naive quantization:
Per-tensor int8 sets scale = absmax / 127. If 99% of values have magnitude ~σ and 1% have magnitude ~50σ, the scale is set to 50σ / 127. The 99% non-outlier values quantize to integers in [-2, 2] — using ~3 bits of effective resolution despite having 8 bits available. The model collapses.
Empirically (Dettmers 2022): below 6.7B params, no significant outliers. Above 6.7B: emergent outlier dimensions in specific feature axes (often layer-correlated — the same dimensions are outliers across many layers). Below 6.7B, naive int8 works; above, it doesn’t.
Response 1: LLM.int8. Keep outlier dimensions in fp16; quantize the rest to int8. Hybrid precision. Memory savings ~50%. Works but the implementation is complex (need to detect outliers at runtime; matmul uses two precisions).
Response 2: GPTQ / AWQ. Use per-channel scales for activations and/or weights. Equivalently: factor out the outlier scale into the per-channel factor, so the quantization grid spans the “normalised” range. Per-block weight quantization + per-channel activation handling closes the gap.
Which is dominant today: AWQ-style (per-channel scale absorption + per-block quantization) is the de facto standard. It’s simpler than GPTQ, faster to compute (one pass over calibration data, no Hessian), and slightly better empirically. GPTQ is still used but is being replaced by AWQ in most stacks.
Why outliers stopped being talked about: the structural fix is now baked into every modern quantization format (per-block scales + per-channel absorption). The outlier problem is a “you have to actively design your format to ignore it” issue, not a fundamental obstacle. q4_K, q5_K, etc. all use this approach.
The deeper point: outliers were never a quantization problem — they were a granularity problem. Once formats moved from per-tensor to per-block (and per-channel for activations), outliers became localised contamination instead of model-wide collapse.
Next: §24.2 — The GGML/llama.cpp quantization family. The exact byte-level layouts of q4_0, q4_1, q5_0, q5_1, q8_0 (legacy formats), q2_K..q6_K (K-quants), the q4_K_M vs q4_K_S naming convention, and IQ-quants with imatrix calibration. The most-deployed quantization scheme in the world.