PTQ basics — int8, int4, scales, zero points

Section 24.1

PTQ basics — int8, int4, scales, zero points

A Llama 2 70B model in fp16 is 140 GB. At 4-bit weights it’s 35 GB — fits on a single H100. At 2-bit weights it’s 17.5 GB — fits on a consumer 24GB GPU with room to spare. The economics of running an LLM are dominated by this number: bits per weight. The technique that achieves it — replacing high-precision weights with low-bit integers plus a small per-block scale — is called quantization. This section covers the basic linear-map math (scale + optional zero point), the choice of per-tensor vs per-block vs per-channel granularity, the outlier problem that drove LLM.int8 → GPTQ → AWQ, and a kernel that measures the round-trip error of each scheme. The next two sections cover the GGML format family (q4_0 through q6_K, the most-deployed quantization scheme in the world) and quantization-aware training.

The quantization linear map

For a single weight value x ∈ ℝ, the standard quantization map is:

Symmetric (no zero point, signed integer): x_q = round(x / scale) x_q ∈ [-(2^{b-1}), 2^{b-1} - 1] x_dq = scale · x_q (dequantize for use) scale is a single fp16 number; b is the bit width (8, 4, 5, etc.). Asymmetric (with zero point, unsigned integer): x_q = round((x - min) / scale) x_q ∈ [0, 2^b - 1] x_dq = scale · x_q + min (dequantize) scale and min are both fp16 numbers per block.

The symmetric form has one parameter (scale) per block; the asymmetric form has two (scale and min). Both are linear maps from a real-valued range onto a discrete integer grid. The cost: b bits per weight for the integer + the per-block fp16 overhead.

Scale quantization parameter A floating-point number (usually fp16) stored once per block of weights that maps the integer representation back to a real value: x ≈ scale · x_q. Chosen so the integer range exactly covers the weight block's range — e.g., scale = max(|x|) / 127 for symmetric int8. The single most-important number in a quantization scheme; getting the scale wrong dominates the quantization error. is the load-bearing parameter. For symmetric int8 with weights in x ∈ [-W_max, W_max], the scale is typically scale = W_max / 127 so the largest weight maps to ±127 (using the full int8 range).

Zero point is the offset that allows the integer grid to be shifted: x ≈ scale · x_q + min. Asymmetric quantization is more accurate for non-symmetric distributions (e.g., post-ReLU activations always ≥ 0) at the cost of one more fp16 per block.

Per-tensor vs per-block vs per-channel

Granularity is the single most-important choice in PTQ: Per-tensor: one scale (and optional min) for the entire weight matrix. cost: ~0 bits overhead per weight. accuracy: bad if there's a single outlier — the outlier sets the scale, blowing up quantization error for everyone else. Per-channel: one scale per output (or input) column of the matrix. cost: ~few extra bits per weight (1 scale per d weights). accuracy: much better — each column's outliers stay local. standard for INT8 GPU inference (e.g., TensorRT). Per-block: one scale per fixed-size block of K consecutive weights (K = 32, 64, 128, 256 depending on the format). cost: 16 bits / K weights = 0.5 bpw at K=32, 0.0625 bpw at K=256. accuracy: excellent — each tiny block can have its own dynamic range. Standard for INT4 quantization (GGML, GPTQ, AWQ).

The granularity vs overhead trade-off is the central one in low-bit quantization. Coarser granularity (per-tensor) saves overhead but suffers from outliers; finer granularity (per-block) eliminates the outlier problem but adds overhead. For int4 (4 bits per weight, the regime most LLM deployments care about), the overhead of per-block scales (0.5 bpw at K=32) is small enough that per-block dominates.

— think, then check —

Symmetric: x_q = round(x / scale); x_dq = scale · x_q. Signed integer; for int4 this means x_q ∈ [-8, 7].

Asymmetric: x_q = round((x - min) / scale); x_dq = scale · x_q + min. Unsigned integer; for int4 this means x_q ∈ [0, 15].

Scale: the slope of the linear map — how much real-value range each integer step covers. For a block of weights with absmax W: scale = W / 7 (symmetric int4) or W / 15 (asymmetric int4). Set once per block.

Zero point (asymmetric only): the offset — the real-value that integer 0 corresponds to. For asymmetric, min is one common parameterisation (the real-value at integer 0 is exactly min). For weights centered around zero, the zero point is small and asymmetric only wastes 1 bit of overhead.

When asymmetric pays off: when the value distribution is skewed (e.g., post-ReLU activations are always ≥ 0; LayerNorm γ values cluster near 1; some weight columns have a strong sign bias). For these, asymmetric uses the full integer range; symmetric wastes half of it on the empty side of zero.

q4_0 is symmetric (no zero point — just scale). q4_1 is asymmetric (scale + min). q4_1 is more accurate at the cost of 1 more fp16 per block (~0.5 extra bpw). For modern LLM weights (zero-mean Gaussian distribution), q4_0’s symmetric assumption is exactly right and the simpler format wins.

↳ §24.1 quant basics

The outlier problem (LLM.int8)

Quantization sounds simple. What broke when people first tried it on real LLMs?

Dettmers 2022 “LLM.int8()” noticed that as transformer models scale past ~6.7B parameters, a small number of “outlier features” emerge in the activations — specific feature dimensions with magnitudes 5-100× larger than the rest. These outliers concentrate in only a few (~6-12) feature dimensions out of d_model = 4096+, but they carry a disproportionate share of the information.

The LLM.int8 outlier observation: At models ≥ 6.7B params, certain activation dimensions become "outlier features" — their values are 5-100× larger than the activation distribution's std. Per-tensor int8 quantization on these activations fails catastrophically: the outlier dimension's magnitude sets the scale, leaving 6-7 effective bits for the non-outlier majority. Perplexity diverges. LLM.int8's fix: separate the outlier dimensions out, run them in fp16, quantize the rest to int8. Hybrid precision.

The kernel below demonstrates the outlier problem on synthetic weights. It runs three schemes on the same input (with a small number of injected outliers) and reports the dequantization RMSE and the relative error of the (W · x) dot product:

ptq_basics.c — per_block_int4 (q4_0-style) C · int8 / int4 round-trip with outliers

            if (q >  127) q =  127;
            if (q < -128) q = -128;
            w_dq[i] = q * scale;
        }
    }
}

/* ---- per-block symmetric int4 (q4_0-style symmetric) ------------------- */
static void per_block_int4(const float* w, float* w_dq, int n, int blk) {
    for (int b = 0; b < n; b += blk) {
        int end = b + blk; if (end > n) end = n;
        float absmax = 0;
        for (int i = b; i < end; i++) {
            float a = fabsf(w[i]);
            if (a > absmax) absmax = a;
        }
        /* int4 signed: range [-8, 7]. We use -8 as the negative extreme
           (slightly asymmetric — same as q4_0's choice). */
        float scale = absmax / 8.0f;
        if (scale == 0) scale = 1.0f;
        for (int i = b; i < end; i++) {

Output:

Weight tensor: N = 8192, std = ~0.02, max |w| = 0.335 (after outliers)

scheme                           bits/wt      dequant RMSE  (W·x) rel error
fp32 (baseline)                  32.0         0.000e+00     0.000e+00
per-tensor int8                  8.0          7.533e-04     3.688e-02
per-block int8 (B=32)            8.5          1.282e-04     2.675e-03
per-block int4 (B=32) ≈ q4_0     4.5          2.285e-03     4.768e-02

Three things to read off:

Per-tensor int8 has a 3.7% dot-product error. The single outlier set the scale to ~0.335 / 127 ≈ 2.6e-3, but most weights are ~0.02 — so most weights only get ~8 distinct quantization levels. Bad.
Per-block int8 is 14× more accurate at almost the same bit count (8.5 vs 8). The outlier’s contamination is confined to one 32-weight block; the other 8160 weights get their own scales.
Per-block int4 is comparable to per-tensor int8 (4.8% vs 3.7%) despite using HALF the bits per weight. The per-block scale is doing more work than the extra 4 bits would have.

This is why all production low-bit quantization is per-block. The format that wins isn’t “more bits” — it’s “smaller blocks.”

— think, then check —

The setup: int8 has 256 quantization levels per weight; int4 has 16 levels per weight. Naively int8 should be 16× more accurate.

What kills per-tensor quantization: the scale = absmax / 127 is set by the LARGEST weight in the tensor. If there’s a single outlier 10× bigger than the typical weight (which is the empirical reality for LLM weights past ~6B params), the typical weight maps to integer values in [-12, 12] — using ~5 bits of effective resolution despite having 8 bits available. The outlier “wastes” most of the integer range.

What blockwise fixes: each block of 32 weights computes its own scale from its own absmax. If only 0.1% of weights are outliers, ~ 99% of blocks have a “normal” scale and use the full integer range. The few outlier-containing blocks pay the contamination cost, but it’s isolated.

For per-block int4 at block size 32:

Each block’s scale absorbs the dynamic range mismatch.
The 16 quantization levels are spread across the block’s actual range, not the tensor’s worst-case range.
Effective resolution per weight is ~3.5 bits within each block’s range — which is enough for fp16-distribution weights.

Why “more bits per weight” without finer blocking doesn’t help: the failure mode isn’t lack of quantization levels — it’s a SCALE mismatch. Adding more bits doesn’t fix the scale; it just gives you more levels at the wrong scale. Smaller blocks change the scale per group, which is what actually matters.

The cost of blockwise: 1 fp16 scale per K weights = 16/K bpw overhead. K=32: 0.5 bpw overhead. K=256: 0.0625 bpw. Modern formats use K=32 (q4_0, q5_0) for high-quality 4/5-bit, or K=256 with hierarchical sub-block scales (q4_K etc.) for slightly lower bpw at similar quality.

↳ §24.1 + outlier kernel

The PTQ method family

Three landmark post-training quantization methods, in chronological order:

GPTQ PTQ method A post-training quantization method (Frantar 2022) that quantizes one layer at a time using a second-order (Hessian-based) approximation of the loss. For each weight column, GPTQ computes the optimal quantization that minimizes the expected output error given the activation statistics of a small calibration dataset. Achieves 4-bit quantization with ~0.5 perplexity loss on Llama-class models. The standard for 4-bit GPU inference before AWQ. processes weights one column at a time. For each column, it quantizes the column’s entries while simultaneously adjusting later columns to compensate for the quantization error. The adjustment uses the per-layer Hessian (computed from calibration activations) to know which directions are loss-sensitive. Result: 4-bit quantization with only ~0.5 perplexity loss on Llama 7B.

AWQ PTQ method A post-training quantization method (Lin 2023) that uses a small calibration dataset to compute per-channel activation magnitudes, then SCALES weights and inverse-scales activations so that important weight channels (those with large activations) have effectively higher precision. Cheaper than GPTQ (no Hessian, just per-channel statistics) and achieves comparable or better quality. The current default for 4-bit GPU inference. is conceptually simpler: it scans calibration activations and identifies which weight channels are “important” (those that multiply large activations). It then divides those weight channels by a per-channel factor s and multiplies the corresponding activation column by s at inference — the math is unchanged, but the quantization grid now spans the important weights with higher effective resolution. Result: comparable to GPTQ at lower compute cost.

— think, then check —

Why outliers break naive quantization:

Per-tensor int8 sets scale = absmax / 127. If 99% of values have magnitude ~σ and 1% have magnitude ~50σ, the scale is set to 50σ / 127. The 99% non-outlier values quantize to integers in [-2, 2] — using ~3 bits of effective resolution despite having 8 bits available. The model collapses.

Empirically (Dettmers 2022): below 6.7B params, no significant outliers. Above 6.7B: emergent outlier dimensions in specific feature axes (often layer-correlated — the same dimensions are outliers across many layers). Below 6.7B, naive int8 works; above, it doesn’t.

Response 1: LLM.int8. Keep outlier dimensions in fp16; quantize the rest to int8. Hybrid precision. Memory savings ~50%. Works but the implementation is complex (need to detect outliers at runtime; matmul uses two precisions).

Response 2: GPTQ / AWQ. Use per-channel scales for activations and/or weights. Equivalently: factor out the outlier scale into the per-channel factor, so the quantization grid spans the “normalised” range. Per-block weight quantization + per-channel activation handling closes the gap.

Which is dominant today: AWQ-style (per-channel scale absorption + per-block quantization) is the de facto standard. It’s simpler than GPTQ, faster to compute (one pass over calibration data, no Hessian), and slightly better empirically. GPTQ is still used but is being replaced by AWQ in most stacks.

Why outliers stopped being talked about: the structural fix is now baked into every modern quantization format (per-block scales + per-channel absorption). The outlier problem is a “you have to actively design your format to ignore it” issue, not a fundamental obstacle. q4_K, q5_K, etc. all use this approach.

The deeper point: outliers were never a quantization problem — they were a granularity problem. Once formats moved from per-tensor to per-block (and per-channel for activations), outliers became localised contamination instead of model-wide collapse.

↳ §24.1 + LLM.int8 + AWQ

Next: §24.2 — The GGML/llama.cpp quantization family. The exact byte-level layouts of q4_0, q4_1, q5_0, q5_1, q8_0 (legacy formats), q2_K..q6_K (K-quants), the q4_K_M vs q4_K_S naming convention, and IQ-quants with imatrix calibration. The most-deployed quantization scheme in the world.