FLOATING POINT, INTEGERS & QUANTIZATION ERROR
Section 3.2
02

Integers and fixed-point

Floats spend bits on dynamic range. Integers don’t have any. That sounds like a downgrade, and for general-purpose computing it would be. For ML inference — where weights and activations have already been normalized to roughly known ranges, and the bandwidth wall is the wall you’re hitting — it’s a feature. Integers fit four times as densely in the same SIMD register, move at four times the rate through cache, and burn a fraction of the energy per operation. The whole “quantize a neural network” industry exists because the math will tolerate the precision cost in exchange for the throughput. This section is the bridge: integer representations on their own, plus the trivial-sounding “scale × integer” trick that lets integers represent reals.

Integer types and ranges

Six integer types you’ll meet in ML code, with the ranges you can hold:

bitsunsigned rangesigned range
int8 / uint880 – 255−128 – 127
int16 / uint16160 – 65 535−32 768 – 32 767
int32 / uint32320 – ~4.3 × 10⁹±~2.1 × 10⁹
int64 / uint64640 – ~1.8 × 10¹⁹±~9.2 × 10¹⁸

All signed integers use two’s complement — the asymmetry (int8 goes to −128 but only +127) is a consequence. The math just works: addition, subtraction, and multiplication use the same circuits for signed and unsigned operands; only comparisons and the high bits of multiplication differ. Most importantly for ML, the cost of an integer op is the same as the cost of a smaller-range integer op — int8 + int8 is no faster than int8 + int8, but it lets you fit four additions into the silicon footprint of one int32 + int32.

Fixed-point — the simplest way to represent a real

You have integers. You want to represent reals. The simplest possible scheme:

real_value = scale × integer_value ↑ ↑ chosen once stored per number

Pick a scale (a float, chosen once for a whole tensor or group). Store one integer per value. To recover the real: multiply. That’s fixed-point. It’s older than IEEE-754 by a few decades — every embedded system that didn’t have an FPU did its arithmetic this way — and it’s the substrate underneath modern quantization (§3.3 adds one more parameter — the zero-point — and you have everything modern frameworks use).

Two consequences fall out:

  1. Every representable value is exactly scale apart. The integers are 1 apart, so their real-valued images are scale apart. Anywhere on the number line — near zero, far from zero, same gap.
  2. The representable range is exactly [−128 × scale, 127 × scale] for signed int8. Outside that, you have to clamp (“saturate”), which loses information silently. Choosing a scale is choosing this range.

The viz below makes the contrast with float visible. Slide the int8 range — what you can represent grows wider, the gap between adjacent values grows with it, but the uniformity is preserved. Float, by contrast, packs its representable points logarithmically: half of them are between -1 and +1, the rest exponentially spread out.

int8 step (scale) = 0.0157 · representable values = 256
float32 (log-uniform spacing — dense near 0)59 representable points shown · gap proportional to |x|int8 × scale (uniform spacing — exactly 0.0157 apart, anywhere)255 of 256 values fit in window · same gap everywhere0
Slide the int8 range and the viewing window. Float32 tick density follows magnitude — half the points are between −1 and +1. Int8 tick density is uniform — every step is exactly scale wide. The tradeoff is visible: integers waste no precision near zero but cannot represent anything beyond ±range.
— think, then check —

Float32 spaces values logarithmically — dense near zero, exponentially sparse far from zero. Gap between consecutive representable values is proportional to the value’s magnitude (~10⁻⁷ × |v|).

Fixed-point int8 spaces values uniformly — every step is exactly scale wide, anywhere in the representable range. There is no “denser near zero” — but there’s also no representation at all beyond ±127 × scale.

Why integers run faster

Three reasons, in increasing depth:

  1. SIMD lane density. A 256-bit AVX2 register fits 8 float32 lanes — or 32 int8 lanes. Four times the work per instruction at the same register width. The 512-bit AVX-512 ratio is 16 vs 64. NEON gives you 4 vs 16. The ratio is fixed by physics: a lane needs space for its bits, and bits are bits.
  2. Memory bandwidth. A 1 MB cache holds 256 K floats — or 1 M int8 values. Four times the working set in the same footprint. Bandwidth from main memory likewise pushes 4× as many int8 values per second. For memory-bound kernels (and every neural network inference kernel above ~50M parameters is memory-bound on consumer hardware), this 4× shows up directly in wall time.
  3. Energy per op. Integer arithmetic skips the exponent unit and the renormalisation step that floats need. On big GPUs and inference-specialised silicon (TPU, Apple Neural Engine, Habana Gaudi) the energy ratio is roughly 4–10× in favour of int8. For batched inference at scale that’s the difference between profitable and unprofitable.

The structural reason quantized inference exists. At a high level: weights and activations from a trained network already cluster in known ranges. Replacing float32 with int8-with-scale captures most of the useful information at 1/4 the bits. The throughput, energy, and bandwidth win 4× across the board. The precision loss is usually under 1% in end-task accuracy for well-quantized models. Quantization is the deployment win, and it’s why every production inference stack (vLLM, TensorRT-LLM, MLX, ONNX Runtime) has aggressive int8/int4 paths.

The accumulator problem (revisited)

If you multiply two int8 values, the result can need up to 16 bits (since 127 × 127 ≈ 16 K and -128 × -128 = 16 K, both well outside the int8 range). If you then sum many such products in a dot product, the accumulator needs to handle the sum. For a dot product of length 1024 of int8 × int8 products, the sum can reach ~1024 × 16 K ≈ 1.6 × 10⁷ — fits in int32 (range ±2 × 10⁹) with room to spare, but not in int16 (range ±32 K).

This is the same argument as §1’s accumulation problem, in integer form: the accumulator needs more bits than the operands. Production int8 dot product kernels uniformly accumulate into int32 for this reason. The fused intrinsics (_mm_maddubs_epi16 widens to int16; _mm256_dpbusd_epi32 widens further to int32; NEON’s SDOT directly into int32) are designed precisely to chain operand-width multiplies into accumulator-width sums, in hardware.

We’ll see all of this in §3 when we build the kernel.

— think, then check —

256 / 32 = 8 float32 lanes.
256 / 16 = 16 float16 (or int16) lanes.
256 / 8 = 32 int8 lanes.

For a compute-bound vectorized op (e.g., FMA-style), int8 throughput is 4× float32 throughput at the register level — same instruction, 32 lanes vs 8. Real-world speedup is usually 3–4× because not every operation has a same-shape int8 equivalent and the accumulator width (int32) costs some throughput on the accumulate step. For memory-bound kernels, the speedup tracks bandwidth, which is also ~4× because each int8 is 1/4 the bytes.

The fixed-point formula in code

Plain C, no SIMD yet — just the conversion in both directions:

/* Encode a real value into a scaled int8. */
int8_t real_to_int8(float x, float scale) {
    float r = x / scale;                    /* unscale */
    int rounded = (int)lrintf(r);           /* nearest-even rounding */
    if (rounded < -128) return -128;        /* clamp to int8 range */
    if (rounded >  127) return  127;
    return (int8_t)rounded;
}

/* Decode an int8 back to a real. */
float int8_to_real(int8_t q, float scale) {
    return scale * (float)q;
}

Two notes worth keeping:

These two lines of policy are the whole protocol of affine quantization — the §3.3 generalization just adds a zero_point to recenter the integer range when the data isn’t centered at zero.

— think, then check —

Worst-case absolute error per value: ½ × scale = ½ × (3.0/127) ≈ 0.012. (Rounding to nearest means at most half a step off.)

Float32’s relative precision is ~10⁻⁷; on a value of magnitude 3, that’s ~3 × 10⁻⁷ absolute error. So int8 quantization here is ~40,000× less precise per value than float32 would be on the same value.

Sounds catastrophic. Two facts rescue it:
(1) ML downstream operations (matmul, softmax) average over many values; quantization errors with mean zero (which good quantizers have, by design) cancel in the sum, and the accumulated error scales as √N rather than N.
(2) The downstream task usually only needs the ranking of activations to be roughly right, not their exact values — and ranking survives much coarser perturbations than absolute reconstruction does.
Net: 1% absolute error on individual values typically translates to under 1% end-task accuracy loss on a well-trained model. That’s why the trade is so attractive in practice.

END OF CH.3 §2 — Integers and fixed-point.
Built: SpacingCompare viz contrasting float32’s log-uniform spacing with int8 fixed-point’s uniform spacing; three recall items spanning the structural contrast, SIMD throughput math, and absolute-vs-relative error.
Coming next: §3.3 — Affine quantization and the int8 dot product kernel. The substrate of every quantized inference stack.