Integers and fixed-point

Section 3.2

Integers and fixed-point

Floats spend bits on dynamic range. Integers don’t have any. That sounds like a downgrade, and for general-purpose computing it would be. For ML inference — where weights and activations have already been normalized to roughly known ranges, and the bandwidth wall is the wall you’re hitting — it’s a feature. Integers fit four times as densely in the same SIMD register, move at four times the rate through cache, and burn a fraction of the energy per operation. The whole “quantize a neural network” industry exists because the math will tolerate the precision cost in exchange for the throughput. This section is the bridge: integer representations on their own, plus the trivial-sounding “scale × integer” trick that lets integers represent reals.

Integer types and ranges

Six integer types you’ll meet in ML code, with the ranges you can hold:

	bits	unsigned range	signed range
`int8` / `uint8`	8	0 – 255	−128 – 127
`int16` / `uint16`	16	0 – 65 535	−32 768 – 32 767
`int32` / `uint32`	32	0 – ~4.3 × 10⁹	±~2.1 × 10⁹
`int64` / `uint64`	64	0 – ~1.8 × 10¹⁹	±~9.2 × 10¹⁸

All signed integers use two’s complement — the asymmetry (int8 goes to −128 but only +127) is a consequence. The math just works: addition, subtraction, and multiplication use the same circuits for signed and unsigned operands; only comparisons and the high bits of multiplication differ. Most importantly for ML, the cost of an integer op is the same as the cost of a smaller-range integer op — int8 + int8 is no faster than int8 + int8, but it lets you fit four additions into the silicon footprint of one int32 + int32.

Fixed-point — the simplest way to represent a real

You have integers. You want to represent reals. The simplest possible scheme:

real_value = scale × integer_value ↑ ↑ chosen once stored per number

Pick a scale (a float, chosen once for a whole tensor or group). Store one integer per value. To recover the real: multiply. That’s fixed-point. It’s older than IEEE-754 by a few decades — every embedded system that didn’t have an FPU did its arithmetic this way — and it’s the substrate underneath modern quantization (§3.3 adds one more parameter — the zero-point — and you have everything modern frameworks use).

Two consequences fall out:

Every representable value is exactly scale apart. The integers are 1 apart, so their real-valued images are scale apart. Anywhere on the number line — near zero, far from zero, same gap.
The representable range is exactly [−128 × scale, 127 × scale] for signed int8. Outside that, you have to clamp (“saturate”), which loses information silently. Choosing a scale is choosing this range.

The viz below makes the contrast with float visible. Slide the int8 range — what you can represent grows wider, the gap between adjacent values grows with it, but the uniformity is preserved. Float, by contrast, packs its representable points logarithmically: half of them are between -1 and +1, the rest exponentially spread out.

int8 range ± 2.0 window ± 2.0

int8 step (scale) = 0.0157 · representable values = 256

Slide the int8 range and the viewing window. Float32 tick density follows magnitude — half the points are between −1 and +1. Int8 tick density is uniform — every step is exactly scale wide. The tradeoff is visible: integers waste no precision near zero but cannot represent anything beyond ±range.

— think, then check —

Float32 spaces values logarithmically — dense near zero, exponentially sparse far from zero. Gap between consecutive representable values is proportional to the value’s magnitude (~10⁻⁷ × |v|).

Fixed-point int8 spaces values uniformly — every step is exactly scale wide, anywhere in the representable range. There is no “denser near zero” — but there’s also no representation at all beyond ±127 × scale.

↳ §3.2 spacing argument

Why integers run faster

Three reasons, in increasing depth:

SIMD lane density. A 256-bit AVX2 register fits 8 float32 lanes — or 32 int8 lanes. Four times the work per instruction at the same register width. The 512-bit AVX-512 ratio is 16 vs 64. NEON gives you 4 vs 16. The ratio is fixed by physics: a lane needs space for its bits, and bits are bits.
Memory bandwidth. A 1 MB cache holds 256 K floats — or 1 M int8 values. Four times the working set in the same footprint. Bandwidth from main memory likewise pushes 4× as many int8 values per second. For memory-bound kernels (and every neural network inference kernel above ~50M parameters is memory-bound on consumer hardware), this 4× shows up directly in wall time.
Energy per op. Integer arithmetic skips the exponent unit and the renormalisation step that floats need. On big GPUs and inference-specialised silicon (TPU, Apple Neural Engine, Habana Gaudi) the energy ratio is roughly 4–10× in favour of int8. For batched inference at scale that’s the difference between profitable and unprofitable.

The structural reason quantized inference exists. At a high level: weights and activations from a trained network already cluster in known ranges. Replacing float32 with int8-with-scale captures most of the useful information at 1/4 the bits. The throughput, energy, and bandwidth win 4× across the board. The precision loss is usually under 1% in end-task accuracy for well-quantized models. Quantization is the deployment win, and it’s why every production inference stack (vLLM, TensorRT-LLM, MLX, ONNX Runtime) has aggressive int8/int4 paths.

The accumulator problem (revisited)

If you multiply two int8 values, the result can need up to 16 bits (since 127 × 127 ≈ 16 K and -128 × -128 = 16 K, both well outside the int8 range). If you then sum many such products in a dot product, the accumulator needs to handle the sum. For a dot product of length 1024 of int8 × int8 products, the sum can reach ~1024 × 16 K ≈ 1.6 × 10⁷ — fits in int32 (range ±2 × 10⁹) with room to spare, but not in int16 (range ±32 K).

This is the same argument as §1’s accumulation problem, in integer form: the accumulator needs more bits than the operands. Production int8 dot product kernels uniformly accumulate into int32 for this reason. The fused intrinsics (_mm_maddubs_epi16 widens to int16; _mm256_dpbusd_epi32 widens further to int32; NEON’s SDOT directly into int32) are designed precisely to chain operand-width multiplies into accumulator-width sums, in hardware.

We’ll see all of this in §3 when we build the kernel.

— think, then check —

256 / 32 = 8 float32 lanes.
256 / 16 = 16 float16 (or int16) lanes.
256 / 8 = 32 int8 lanes.

For a compute-bound vectorized op (e.g., FMA-style), int8 throughput is 4× float32 throughput at the register level — same instruction, 32 lanes vs 8. Real-world speedup is usually 3–4× because not every operation has a same-shape int8 equivalent and the accumulator width (int32) costs some throughput on the accumulate step. For memory-bound kernels, the speedup tracks bandwidth, which is also ~4× because each int8 is 1/4 the bytes.

↳ §3.2 SIMD density

The fixed-point formula in code

Plain C, no SIMD yet — just the conversion in both directions:

/* Encode a real value into a scaled int8. */
int8_t real_to_int8(float x, float scale) {
    float r = x / scale;                    /* unscale */
    int rounded = (int)lrintf(r);           /* nearest-even rounding */
    if (rounded < -128) return -128;        /* clamp to int8 range */
    if (rounded >  127) return  127;
    return (int8_t)rounded;
}

/* Decode an int8 back to a real. */
float int8_to_real(int8_t q, float scale) {
    return scale * (float)q;
}

Two notes worth keeping:

Round, don’t truncate. Truncating (cast to int) biases negative values toward zero and positive values away. lrintf rounds to nearest-even, the IEEE-754-blessed rounding mode, which is unbiased.
Clamp, don’t wrap. Casting a 200.0f to int8_t in C is undefined behavior (overflow during the float-to-int cast); even if it gave you the wrapped value (-56), that’s a catastrophic error in the dot product output. Production quantizers always clamp.

These two lines of policy are the whole protocol of affine quantization — the §3.3 generalization just adds a zero_point to recenter the integer range when the data isn’t centered at zero.

— think, then check —

Worst-case absolute error per value: ½ × scale = ½ × (3.0/127) ≈ 0.012. (Rounding to nearest means at most half a step off.)

Float32’s relative precision is ~10⁻⁷; on a value of magnitude 3, that’s ~3 × 10⁻⁷ absolute error. So int8 quantization here is ~40,000× less precise per value than float32 would be on the same value.

Sounds catastrophic. Two facts rescue it:
(1) ML downstream operations (matmul, softmax) average over many values; quantization errors with mean zero (which good quantizers have, by design) cancel in the sum, and the accumulated error scales as √N rather than N.
(2) The downstream task usually only needs the ranking of activations to be roughly right, not their exact values — and ranking survives much coarser perturbations than absolute reconstruction does.
Net: 1% absolute error on individual values typically translates to under 1% end-task accuracy loss on a well-trained model. That’s why the trade is so attractive in practice.

↳ §3.2 + §3.1 relative precision

END OF CH.3 §2 — Integers and fixed-point.
Built: SpacingCompare viz contrasting float32’s log-uniform spacing with int8 fixed-point’s uniform spacing; three recall items spanning the structural contrast, SIMD throughput math, and absolute-vs-relative error.
Coming next: §3.3 — Affine quantization and the int8 dot product kernel. The substrate of every quantized inference stack.