QUANTIZATION IN PRACTICE
Section 24.2
02

The GGML quantization family — q4_0 to q6_K, and the q4_K_M naming

Every GGUF file on Hugging Face — every Llama, Mistral, Qwen, Gemma, DeepSeek model people actually run on laptops — uses one of about a dozen GGML quantization formats. Their names look cryptic (q4_0, q4_K_M, IQ3_XXS) but they encode a precise tradeoff: bits per weight × accuracy × inference speed. This section walks through the entire family with exact byte-level struct layouts, explains the “K” (super-block) and the “M / S / L” (mixed precision per tensor) naming convention, and covers the imatrix calibration that powers the IQ-quants. By the end you’ll know exactly what file size a q4_K_M Llama 3.3 70B will produce, why q4_K_S is smaller, and what trade-off you’re making.

The legacy formats — q4_0, q4_1, q5_0, q5_1, q8_0

The original llama.cpp formats use a fixed 32-weight block size with a single fp16 scale (and optionally a min). The struct layouts straight from ggml-common.h:

block_q4_0 (4.5 bpw): struct { ggml_half d; // 2 bytes — fp16 scale uint8_t qs[16]; // 16 bytes — 32 signed 4-bit values packed as nibbles }; // total 18 bytes / 32 weights = 4.5 bpw block_q4_1 (5.0 bpw): struct { ggml_half d; // 2 bytes — fp16 scale ggml_half m; // 2 bytes — fp16 min (zero point) uint8_t qs[16]; }; // total 20 bytes / 32 weights = 5.0 bpw block_q5_0 (5.5 bpw): struct { ggml_half d; // 2 bytes — fp16 scale uint8_t qh[4]; // 4 bytes — high-bit for each of 32 weights uint8_t qs[16]; // 16 bytes — low 4 bits each (32 values) }; // total 22 bytes / 32 weights = 5.5 bpw block_q5_1 (6.0 bpw): { d, m, qh[4], qs[16] } // 24 bytes / 32 weights = 6.0 bpw block_q8_0 (8.5 bpw): struct { ggml_half d; // 2 bytes — fp16 scale int8_t qs[32]; // 32 bytes — direct int8 values }; // total 34 bytes / 32 weights = 8.5 bpw

The trailing “_0” means symmetric (scale only); “_1” means asymmetric (scale + min). The leading number is the integer bit width (4, 5, 8). The 0.5 bpw overhead for q4_0 comes from one fp16 (2 bytes = 16 bits) per 32 weights = 0.5 bpw — exactly the block scale cost.

q4_0 is the workhorse — simple, symmetric, accurate enough for most use cases. The kernel from this section implements its exact 18-byte layout:

ggml_q40.c — quantize_block_q4_0 C · exact ggml q4_0 quantize + dequantize
    ggml_half d;
    uint8_t   qs[QK4_0 / 2];   /* 16 bytes, 32 nibbles */
} block_q4_0;

/* Compile-time check */
_Static_assert(sizeof(block_q4_0) == 18, "block_q4_0 must be 18 bytes");

/* q4_0 quantize: one block of 32 floats → one 18-byte block_q4_0 */
static void quantize_block_q4_0(const float* x, block_q4_0* out) {
    float amax = 0;
    int   imax = 0;
    for (int i = 0; i < QK4_0; i++) {
        float a = fabsf(x[i]);
        if (a > amax) { amax = a; imax = i; }
    }
    /* d = max_signed_val / -8  ↔  scale = -d  for negative range alignment.
       This is the actual ggml convention: pick the scale so the value with
       the LARGEST MAGNITUDE (preserving sign) maps to -8 if negative or +7
       if positive. Effectively d = max / -8, the sign of x[imax] absorbed. */
    float d = x[imax] / -8.0f;
    if (d == 0) d = 1.0f;
    float id = 1.0f / d;
    out->d = fp32_to_fp16(d);

    /* Pack: nibble k from x[k] goes in low 4 bits of qs[k];
       nibble k from x[k+16] goes in high 4 bits of qs[k]. */
    for (int k = 0; k < QK4_0 / 2; k++) {

Output:

format                           bytes/block     bpw          RMSE
fp32 (baseline)                  n/a             32.0         0.000e+00
q4_0  (18 B / 32 weights)        18              4.5000       2.004e-03
q4_K simplified (162 B / 256 w)  162             5.0625       1.676e-03

The q4_0 RMSE on a realistic weight distribution (zero-mean Gaussian, std 0.02, sparse 10× outliers) is ~2e-3. For weights of typical magnitude ~0.02, this is ~10% relative error per weight — but the dot products that actually matter (W · x for the matmul) accumulate to much lower relative error because the per-weight errors are nearly independent.

— think, then check —

Exact layout (18 bytes / 32 weights):

  • Bytes 0-1: ggml_half d (fp16 scale)
  • Bytes 2-17: uint8_t qs[16] (32 signed-4-bit values, packed)

Why 4.5 bpw, not 4.0: 4.0 bpw would be the integer cost alone. The 0.5 bpw is the fp16 scale: 16 bits / 32 weights = 0.5 bpw. So total = 4 + 0.5 = 4.5.

Nibble layout (the surprising part): in ggml’s q4_0, qs[k]‘s LOW nibble holds weight x[k]‘s quantized value; qs[k]‘s HIGH nibble holds weight x[k+16]‘s quantized value. Not x[k+1] (consecutive pair). The high nibble holds the value 16 positions later.

Why this layout: SIMD dequantization. To dequantize a block on AVX2 / NEON, you want to load 16 packed bytes and produce 32 int8 values in two SIMD registers. The (low_nibble, high_nibble) layout means a single AND + shift produces lanes 0-15; another shift + AND produces lanes 16-31. If the layout were (x[2k], x[2k+1]) (consecutive), you’d need a more expensive interleave to deinterleave for the matmul. The “split half” layout is optimal for SIMD unpacking.

This is a small but important detail — the kernel speed depends on it.

The K-quants — super-blocks with hierarchical scales

The legacy formats have 32-weight blocks with a single per-block scale. The K-quants (introduced around llama.cpp 2023) use a different structure:

block_q4_K (4.5 bpw — same as q4_0, but better quality): struct { ggml_half d, dmin; // 4 bytes — super-block scale + min uint8_t scales[12]; // 12 bytes — 16 6-bit sub-block (scale, min) pairs uint8_t qs[128]; // 128 bytes — 256 × 4-bit }; // 4 + 12 + 128 = 144 bytes / 256 weights = 4.5 bpw block_q5_K (5.5 bpw): { d, dmin, scales[12], qh[32], qs[128] } // 4 + 12 + 32 + 128 = 176 bytes / 256 = 5.5 bpw block_q6_K (6.5625 bpw): { ql[128], qh[64], scales[16], d } // 128 + 64 + 16 + 2 = 210 bytes / 256 = 6.5625 bpw block_q3_K (3.4375 bpw): { hmask[32], qs[64], scales[12], d } // 32 + 64 + 12 + 2 = 110 bytes / 256 = 3.4375 bpw block_q2_K (2.625 bpw): { scales[16], qs[64], d, dmin } // 16 + 64 + 4 = 84 bytes / 256 = 2.625 bpw

K-quants beat the legacy formats at the same bpw because the hierarchical sub-block scales contain outlier contamination to 16 weights instead of 32. The exact byte layout cleverly packs 16 sub-block scales (and sometimes 16 sub-block mins) into 12 bytes by using 6-bit fields and the super-block d for renormalisation.

— think, then check —

q4_0: 32-weight blocks, 1 fp16 scale per block, 4 bits per weight.

Bytes per block: 2 (scale) + 16 (qs) = 18.

Per-block scale overhead: 16 bits / 32 weights = 0.5 bpw.

q4_K: 256-weight super-block of 16 × 16-weight sub-blocks. Each sub-block has its own 6-bit scale; the whole super-block has 1 fp16 d that the sub-scales multiply against.

Bytes per super-block: 4 (d, dmin) + 12 (16 packed 6-bit scales + 16 packed 6-bit mins) + 128 (qs) = 144.

Per-weight cost: 144 · 8 / 256 = 4.5 bpw. Same as q4_0.

What changes:

  • The scale granularity changed from “1 per 32 weights” (q4_0) to “1 per 16 weights” (q4_K) — twice as fine.
  • The scales themselves are 6-bit instead of fp16. Each 6-bit scale is multiplied by the fp16 super-block d to reconstruct the per-sub-block scale.
  • Both scale AND min are stored per sub-block (q4_K is asymmetric like q4_1).

Why it’s the same bpw despite finer scales: q4_K trades scale precision (fp16 → 6-bit) for scale granularity (32→16 weight scope). The 6-bit sub-scales are quantized values, not full-precision; the fp16 super-block d provides their global range. This compression of scale storage is what lets q4_K fit more scales without using more bpw.

Why it’s better quality: outliers concentrate in 16-weight sub-blocks instead of 32. A weight with magnitude 5σ contaminates 16 weights’ precision in q4_K vs 32 in q4_0. On real LLM weight distributions (which have sparse outliers within larger clusters), the finer granularity captures more dynamic range. Empirical perplexity gap on Llama-class models: ~0.1-0.3 in q4_K’s favour, free.

The kernel’s “q4_K simplified” version uses fp16 sub-scales (no 6-bit packing) and is 162 B / 256 weights = 5.06 bpw — and achieves slightly lower RMSE than q4_0 at the same demonstration. Real q4_K achieves the same 4.5 bpw as q4_0 by being clever about scale packing.

The q4_K_M vs q4_K_S naming — mixed precision

This is the convention that confuses everyone. The “_M” and “_S” don’t modify the format itself; they’re llama.cpp model-file flavors that mix multiple K-quant formats across a model’s tensors:

Model file naming convention (LLAMA_FTYPE_MOSTLY_*): q4_K_S (small): - All tensors quantized at q4_K (4.5 bpw) - except: attn_v and ffn_down in the first 1/8 of layers → q5_K - Roughly: ~4.5 bpw average q4_K_M (medium — the recommended default): - Most tensors at q4_K (4.5 bpw) - attn_v: q6_K (6.5 bpw) for layers where use_more_bits() is true - ffn_down: q6_K for the same layer subset - token_embd: q6_K - output: q6_K - Roughly: ~4.85 bpw average — 4-5% larger than q4_K_S, noticeably better quality q3_K_L (large variant of q3): - Base q3_K plus aggressive upgrades on attn_v, ffn_down, etc. - Closer to q4_K_S in total size q5_K_M: - Most at q5_K, attn_v + ffn_down at q6_K - The "high-quality" everyday quant

The intuition: some tensors matter more than others. The output projection and embedding (which directly affect token logits) and attn_v + ffn_down (which the empirical sensitivity analysis flags as the most error-sensitive) get bumped to a higher precision. The rest stays at the base format.

q4_K_M is the default choice for serious 4-bit deployment. Llama 3.3 70B in q4_K_M is about 42-43 GB; in q4_K_S it’s 40 GB; in fp16 it’s 140 GB. The 7% size difference between q4_K_S and q4_K_M typically buys ~0.1-0.2 perplexity, which translates to noticeably better quality on long-form generation.

— think, then check —

What changes structurally:

q4_K_S uses q4_K (4.5 bpw) for almost every tensor. A small set of “important” tensors in early layers (attn_v + ffn_down) are bumped to q5_K (5.5 bpw).

q4_K_M upgrades MORE tensors to higher precision:

  • attn_v → q6_K (6.5 bpw) for a significant subset of layers (use_more_bits returns true)
  • ffn_down → q6_K for the same subset
  • token_embd → q6_K (the input embedding table)
  • output → q6_K (the output projection, often tied to token_embd)

The other tensors (attn_q, attn_k, attn_o, ffn_gate, ffn_up) stay at q4_K.

Why these specific tensors: empirical sensitivity analysis. The value projection (attn_v) determines the actual information passed to subsequent layers; the FFN’s down projection determines what gets written to the residual stream; the embeddings determine token-level representation quality. Errors in these compound. The query, key, gate, up projections are more error-tolerant — quantization noise in them often averages out through subsequent matmuls.

The 2 GB cost: ~5% larger file. ~5% higher RAM usage at inference. ~5% slower if you’re memory-bandwidth-bound (because matmul throughput is roughly proportional to bytes-loaded).

The quality buy: empirically, q4_K_M perplexity is 0.1-0.2 points lower than q4_K_S on Llama 7B benchmarks. For chat use, this typically manifests as fewer logical errors in long-form responses, slightly better adherence to instructions, and more accurate factual recall. For coding it’s ~10-15% fewer subtle bugs.

Worth it? For chat / interactive use: yes, the quality bump is noticeable. For batch / throughput-critical: q4_K_S wins on cost. Most production inference defaults to q4_K_M for human-facing systems.

IQ-quants and the imatrix

The newest generation of GGML formats (introduced 2024 by Ikawrakow) are the IQ-quants: IQ1_S, IQ2_XXS, IQ2_XS, IQ3_XXS, IQ3_S, IQ4_XS, etc. They achieve sub-2-bit average compression on real LLMs while preserving usable quality.

The key innovations:

  1. Codebook-based quantization. Instead of mapping each weight directly to an integer, IQ-quants map small groups of weights (typically 8 weights at a time) to a codebook entry. The codebook has 256 or 512 pre-trained 8-vector patterns. Each block stores a codebook index per group, plus a per-sub-block scale.
  2. Importance matrix (imatrix) calibration. Optional but recommended: run a small calibration corpus (~100K tokens) through the unquantized model and record per-tensor activation statistics. The quantizer uses these to weight the quantization error so that “important” weight features (those that multiply large activations) are quantized more precisely.

The imatrix file is a simple binary format: for each weight tensor, one float per feature dimension representing E[x²] over the calibration data. Generating it takes a few minutes per model on a single GPU.

Quality / size summary across the GGML family (Llama 7B baseline): format bpw file size (7B) perplexity vs fp16 fp16 16.0 13.0 GB 0.000 (baseline) q8_0 8.5 7.0 GB +0.001 q6_K 6.5625 5.3 GB +0.004 q5_K_M 5.7 4.8 GB +0.013 q4_K_M 4.85 4.1 GB +0.043 ← recommended default q4_K_S 4.58 3.9 GB +0.063 q4_0 4.5 3.8 GB +0.110 q3_K_M 3.91 3.3 GB +0.155 IQ3_XXS ~3.06 2.6 GB +0.210 ← with imatrix q2_K 2.625 2.7 GB +0.700 IQ2_XS ~2.43 2.3 GB +0.450 ← with imatrix IQ2_XXS ~2.06 1.9 GB +1.300 ← with imatrix; getting dicey IQ1_S ~1.50 1.4 GB +4.000+ ← deep degradation

The sweet spot for production deployment is q4_K_M for chat, q5_K_M for higher quality at slightly more memory, and IQ3_XXS for very memory-constrained settings. q4_0 still exists but is mostly obsolete; q4_K_M dominates the same regime with better quality. IQ1_S exists but is more of a research curiosity (perplexity degradation is severe).

— think, then check —

What the imatrix records:

For each weight tensor W in the model, the imatrix stores a per-feature vector E[x²] — the mean squared value of the activations that multiply that weight tensor, averaged over a small calibration dataset (typically 100K-1M tokens).

Concretely for a matmul Y = X · W where X is (B, d_in) activations and W is (d_in, d_out) weights:

imatrix[i] = mean over calibration data of X[i]² (the squared activation entering input feature i).

How the quantizer uses it:

The naive quantization objective is “minimise || W − W_dq ||²” — the L2 error of the weight reconstruction.

The imatrix-aware objective is “minimise || X · (W − W_dq) ||²” — the L2 error of the matmul output, which is what actually matters for model quality.

Expanding: || X · ΔW ||² = Σ_i (X_i² · ΔW_i²) — the error in feature i contributes proportionally to E[X_i²]. So the quantization error in features with LARGE activations matters more than in features with SMALL activations.

The imatrix-aware quantizer (used by IQ-quants and optionally by K-quants) chooses quantization grid points and codebook entries to minimise this WEIGHTED error. Features with large E[x²] get tighter quantization (better precision); features with small E[x²] can be quantized more coarsely.

Why this matters more at low bit widths:

At 6-bit and above, there are enough quantization levels that even uniform precision is “good enough.” At 4-bit, the gap matters but is small. At 2-3 bit, the difference between “uniform precision” and “activation-weighted precision” is the difference between a working model and a broken one.

Cost: ~5 minutes per model to generate the imatrix. A few MB of disk for the file itself. Free at inference time — the imatrix is consumed only during quantization, not during model use.

The deeper connection to AWQ: AWQ does essentially the same thing for full-precision GPU inference (use activation statistics to scale weights). The imatrix brings this into the on-disk GGML format. Different deployment targets (GPU vs CPU/Mac), same underlying mathematical idea: quantize so the matmul output error is minimised, not the weight error.

Next: §24.3 — Quantization-aware training. So far we’ve quantized AFTER training (PTQ). Now we’ll quantize DURING training: the straight-through estimator (STE), Learned Step Size Quantization (LSQ), BitNet’s 1-bit-from-scratch approach, and QLoRA — the technique that lets you fine-tune a 70B model on a single 24 GB GPU by keeping the base in 4-bit and only updating fp16 LoRA adapters.