Integers and fixed-point
Floats spend bits on dynamic range. Integers don’t have any. That sounds like a downgrade, and for general-purpose computing it would be. For ML inference — where weights and activations have already been normalized to roughly known ranges, and the bandwidth wall is the wall you’re hitting — it’s a feature. Integers fit four times as densely in the same SIMD register, move at four times the rate through cache, and burn a fraction of the energy per operation. The whole “quantize a neural network” industry exists because the math will tolerate the precision cost in exchange for the throughput. This section is the bridge: integer representations on their own, plus the trivial-sounding “scale × integer” trick that lets integers represent reals.
Integer types and ranges
Six integer types you’ll meet in ML code, with the ranges you can hold:
| bits | unsigned range | signed range | |
|---|---|---|---|
int8 / uint8 | 8 | 0 – 255 | −128 – 127 |
int16 / uint16 | 16 | 0 – 65 535 | −32 768 – 32 767 |
int32 / uint32 | 32 | 0 – ~4.3 × 10⁹ | ±~2.1 × 10⁹ |
int64 / uint64 | 64 | 0 – ~1.8 × 10¹⁹ | ±~9.2 × 10¹⁸ |
All signed integers use two’s complement — the asymmetry (int8 goes to −128 but only +127) is a consequence. The math just works: addition, subtraction, and multiplication use the same circuits for signed and unsigned operands; only comparisons and the high bits of multiplication differ. Most importantly for ML, the cost of an integer op is the same as the cost of a smaller-range integer op — int8 + int8 is no faster than int8 + int8, but it lets you fit four additions into the silicon footprint of one int32 + int32.
Fixed-point — the simplest way to represent a real
You have integers. You want to represent reals. The simplest possible scheme:
Pick a scale (a float, chosen once for a whole tensor or group). Store one integer per value. To recover the real: multiply. That’s fixed-point. It’s older than IEEE-754 by a few decades — every embedded system that didn’t have an FPU did its arithmetic this way — and it’s the substrate underneath modern quantization (§3.3 adds one more parameter — the zero-point — and you have everything modern frameworks use).
Two consequences fall out:
- Every representable value is exactly scale apart. The integers are 1 apart, so their real-valued images are scale apart. Anywhere on the number line — near zero, far from zero, same gap.
- The representable range is exactly [−128 × scale, 127 × scale] for signed int8. Outside that, you have to clamp (“saturate”), which loses information silently. Choosing a
scaleis choosing this range.
The viz below makes the contrast with float visible. Slide the int8 range — what you can represent grows wider, the gap between adjacent values grows with it, but the uniformity is preserved. Float, by contrast, packs its representable points logarithmically: half of them are between -1 and +1, the rest exponentially spread out.
scale wide. The
tradeoff is visible: integers waste no precision near zero but cannot represent anything beyond ±range.Float32 spaces values logarithmically — dense near zero, exponentially sparse far from zero. Gap between consecutive representable values is proportional to the value’s magnitude (~10⁻⁷ × |v|).
Fixed-point int8 spaces values uniformly — every step is exactly scale wide, anywhere in the representable range. There is no “denser near zero” — but there’s also no representation at all beyond ±127 × scale.
Why integers run faster
Three reasons, in increasing depth:
- SIMD lane density. A 256-bit AVX2 register fits 8 float32 lanes — or 32 int8 lanes. Four times the work per instruction at the same register width. The 512-bit AVX-512 ratio is 16 vs 64. NEON gives you 4 vs 16. The ratio is fixed by physics: a lane needs space for its bits, and bits are bits.
- Memory bandwidth. A 1 MB cache holds 256 K floats — or 1 M int8 values. Four times the working set in the same footprint. Bandwidth from main memory likewise pushes 4× as many int8 values per second. For memory-bound kernels (and every neural network inference kernel above ~50M parameters is memory-bound on consumer hardware), this 4× shows up directly in wall time.
- Energy per op. Integer arithmetic skips the exponent unit and the renormalisation step that floats need. On big GPUs and inference-specialised silicon (TPU, Apple Neural Engine, Habana Gaudi) the energy ratio is roughly 4–10× in favour of int8. For batched inference at scale that’s the difference between profitable and unprofitable.
The structural reason quantized inference exists. At a high level: weights and activations from a trained network already cluster in known ranges. Replacing float32 with int8-with-scale captures most of the useful information at 1/4 the bits. The throughput, energy, and bandwidth win 4× across the board. The precision loss is usually under 1% in end-task accuracy for well-quantized models. Quantization is the deployment win, and it’s why every production inference stack (vLLM, TensorRT-LLM, MLX, ONNX Runtime) has aggressive int8/int4 paths.
The accumulator problem (revisited)
If you multiply two int8 values, the result can need up to 16 bits (since 127 × 127 ≈ 16 K and -128 × -128 = 16 K, both well outside the int8 range). If you then sum many such products in a dot product, the accumulator needs to handle the sum. For a dot product of length 1024 of int8 × int8 products, the sum can reach ~1024 × 16 K ≈ 1.6 × 10⁷ — fits in int32 (range ±2 × 10⁹) with room to spare, but not in int16 (range ±32 K).
This is the same argument as §1’s accumulation problem, in integer form: the accumulator needs more bits than the operands. Production int8 dot product kernels uniformly accumulate into int32 for this reason. The fused intrinsics (_mm_maddubs_epi16 widens to int16; _mm256_dpbusd_epi32 widens further to int32; NEON’s SDOT directly into int32) are designed precisely to chain operand-width multiplies into accumulator-width sums, in hardware.
We’ll see all of this in §3 when we build the kernel.
256 / 32 = 8 float32 lanes.
256 / 16 = 16 float16 (or int16) lanes.
256 / 8 = 32 int8 lanes.
For a compute-bound vectorized op (e.g., FMA-style), int8 throughput is 4× float32 throughput at the register level — same instruction, 32 lanes vs 8. Real-world speedup is usually 3–4× because not every operation has a same-shape int8 equivalent and the accumulator width (int32) costs some throughput on the accumulate step. For memory-bound kernels, the speedup tracks bandwidth, which is also ~4× because each int8 is 1/4 the bytes.
The fixed-point formula in code
Plain C, no SIMD yet — just the conversion in both directions:
/* Encode a real value into a scaled int8. */
int8_t real_to_int8(float x, float scale) {
float r = x / scale; /* unscale */
int rounded = (int)lrintf(r); /* nearest-even rounding */
if (rounded < -128) return -128; /* clamp to int8 range */
if (rounded > 127) return 127;
return (int8_t)rounded;
}
/* Decode an int8 back to a real. */
float int8_to_real(int8_t q, float scale) {
return scale * (float)q;
}
Two notes worth keeping:
- Round, don’t truncate. Truncating (cast to int) biases negative values toward zero and positive values away.
lrintfrounds to nearest-even, the IEEE-754-blessed rounding mode, which is unbiased. - Clamp, don’t wrap. Casting a 200.0f to
int8_tin C is undefined behavior (overflow during the float-to-int cast); even if it gave you the wrapped value (-56), that’s a catastrophic error in the dot product output. Production quantizers always clamp.
These two lines of policy are the whole protocol of affine quantization — the §3.3 generalization just adds a zero_point to recenter the integer range when the data isn’t centered at zero.
Worst-case absolute error per value: ½ × scale = ½ × (3.0/127) ≈ 0.012. (Rounding to nearest means at most half a step off.)
Float32’s relative precision is ~10⁻⁷; on a value of magnitude 3, that’s ~3 × 10⁻⁷ absolute error. So int8 quantization here is ~40,000× less precise per value than float32 would be on the same value.
Sounds catastrophic. Two facts rescue it:
(1) ML downstream operations (matmul, softmax) average over many values; quantization errors with mean zero (which good quantizers have, by design) cancel in the sum, and the accumulated error scales as √N rather than N.
(2) The downstream task usually only needs the ranking of activations to be roughly right, not their exact values — and ranking survives much coarser perturbations than absolute reconstruction does.
Net: 1% absolute error on individual values typically translates to under 1% end-task accuracy loss on a well-trained model. That’s why the trade is so attractive in practice.
END OF CH.3 §2 — Integers and fixed-point.
Built: SpacingCompare viz contrasting float32’s log-uniform spacing with int8 fixed-point’s uniform spacing; three recall items spanning the structural contrast, SIMD throughput math, and absolute-vs-relative error.
Coming next: §3.3 — Affine quantization and the int8 dot product kernel. The substrate of every quantized inference stack.