THE HARDWARE SUBSTRATE
Section 21.1
01

GPU memory hierarchy + the roofline model

Throughout Parts III and IV we’ve repeated a phrase: “bandwidth-bound, not compute-bound.” It’s the central observation that drives FlashAttention, MoE inference, KV cache optimisation, and quantization. This section makes it rigorous. The “roofline model” (Williams 2009) classifies every operation by its arithmetic intensity — FLOPs per byte loaded from memory — and predicts whether the operation will be limited by peak FLOPs or by memory bandwidth. The boundary is the “ridge point”, set by hardware as peak_FLOPs / peak_bandwidth. For H100 at ~300 FLOPs/byte, almost every LLM operation EXCEPT large GEMM falls BELOW the ridge — meaning the hardware spends most of its time waiting for HBM. This is why LLM inference is bandwidth-bound; this is the structural fact that shapes the entire systems stack.

The GPU memory hierarchy

Modern GPUs have a 3-tier (sometimes 4-tier) memory hierarchy:

H100 / B200 memory tiers (latency · bandwidth · capacity): Registers: ~1 cycle, per-thread, ~256 KB total per SM L1 / SRAM: ~30 cycles, ~228 KB per SM (~228 KB × 132 SMs = ~30 MB total) also: ~5 TB/s aggregate bandwidth L2 cache: ~200 cycles, 50 MB total ~5 TB/s aggregate bandwidth HBM (HBM3): ~400-1000 cycles H100: 80 GB at 3.35 TB/s B200: 192 GB at 8 TB/s Off-package: ~5000 cycles, ~10K cycles for NVLink-to-peer The ratio HBM bandwidth / SRAM bandwidth: ~3 TB/s vs ~5 TB/s (per SM) But SRAM only holds 228 KB per SM — for a 70B model it fits NONE of the weights. Practical implication: - Anything that "lives in SRAM" is bandwidth-fast. - Anything that has to traverse HBM is bandwidth-bound by 3-8 TB/s. - Optimisation = keep operations in SRAM as long as possible.

HBM is where everything lives. SRAM is the fast scratchpad. The art of GPU kernel design is staging data through SRAM as much as possible.

Arithmetic intensity

The key metric: how many FLOPs of work does each loaded byte enable?

Arithmetic Intensity (AI) = FLOPs / bytes_loaded_from_HBM For a dot product y = sum_i a_i · b_i over N elements: FLOPs: 2N (one multiply + one add per element) Bytes: 2N · 2 = 4N (N elements of fp16 from each vector) AI: 2N / 4N = 0.5 FLOPs/byte For a matmul C = A·B with A ∈ ℝ^{m×k}, B ∈ ℝ^{k×n}, C ∈ ℝ^{m×n}: FLOPs: 2·m·n·k Bytes: m·k·2 + k·n·2 + m·n·2 ≈ 2·(m·k + k·n + m·n) for fp16 AI: 2·m·n·k / 2·(m·k + k·n + m·n) ≈ m·n·k / (m·k + k·n) (output writes are usually small) ≈ 1 / (1/n + 1/m) (when k ≫ m, n) ≈ min(m, n) (when m ≈ n) So matmul's AI ≈ min(m, n). Tiny matmul (m=n=8): AI ≈ 8. (very memory-bound) Big matmul (m=n=1024): AI ≈ 1024. (very compute-bound)

Arithmetic intensity is the operation’s intrinsic property — independent of which GPU you run on. The hardware’s ridge point determines which regime applies.

The roofline equation

Roofline model: achievable_FLOPs(op) = min( peak_FLOPs, peak_bandwidth · AI(op) ) If AI(op) > peak_FLOPs / peak_bandwidth → compute-bound (max out FLOPs) If AI(op) < peak_FLOPs / peak_bandwidth → memory-bound (max out bandwidth) ridge_AI = peak_FLOPs / peak_bandwidth For H100: peak_FLOPs: ~1 PFLOPs (bf16) = 1e15 FLOPs/s peak_BW: 3.35 TB/s = 3.35e12 B/s ridge_AI: 1e15 / 3.35e12 ≈ 300 FLOPs/byte For B200: peak_FLOPs: ~4.5 PFLOPs (bf16) = 4.5e15 peak_BW: 8 TB/s = 8e12 ridge_AI: 4.5e15 / 8e12 ≈ 560 FLOPs/byte For Apple M3 Max: peak_FLOPs: ~50 TFLOPs (fp16) = 5e13 peak_BW: 400 GB/s = 4e11 ridge_AI: 5e13 / 4e11 = 125 FLOPs/byte

The ridge point varies dramatically across hardware. A B200 needs ~2× higher arithmetic intensity than H100 to be compute-bound; a Mac Studio’s lower ridge means more ops can be compute-bound there (relatively easier to feed the smaller compute capacity).

Now make it run

The kernel computes arithmetic intensity for common LLM operations and labels each by roofline regime:

roofline.c — main loop C · arithmetic intensity for LLM ops
       AI = 2N / (2N·2) = 0.5 F/B. Memory-bound. */
    {
        int N = 4096;
        double flops = 2.0 * N;
        double bytes = 2.0 * N * 2;
        report_op("dot product (N=4096, fp16)", flops, bytes, H100_RIDGE);
    }

    /* GEMV: y = M · x. M is m × n, x is n.
       FLOPs: 2·m·n. Bytes: m·n·2 (for M, the dominant input) + n·2 (for x) ≈ 2·m·n.
       AI = 2·m·n / (2·m·n) = 1 F/B. Memory-bound. */
    {
        int m = 4096, n = 4096;
        double flops = 2.0 * m * n;
        double bytes = (double)m * n * 2 + (double)n * 2 + (double)m * 2;
        report_op("GEMV (m=n=4096, fp16)", flops, bytes, H100_RIDGE);
    }

    /* GEMM small: C = A · B. A is m × k, B is k × n. m=n=8 (decode batch), k=4096.
       FLOPs: 2·m·n·k. Bytes: m·k·2 + k·n·2 + m·n·2.
       For m=n=8: bytes dominated by A and B (similar sizes); AI is low. */
    {
        int m = 8, n = 8, k = 4096;
        double flops = 2.0 * m * n * k;
        double bytes = (double)m * k * 2 + (double)k * n * 2 + (double)m * n * 2;
        report_op("GEMM small (m=n=8, k=4096)", flops, bytes, H100_RIDGE);
    }

    /* GEMM medium: m=n=128, k=4096 (still typical batch size for inference). */
    {
        int m = 128, n = 128, k = 4096;
        double flops = 2.0 * m * n * k;
        double bytes = (double)m * k * 2 + (double)k * n * 2 + (double)m * n * 2;
        report_op("GEMM medium (m=n=128, k=4096)", flops, bytes, H100_RIDGE);
    }

    /* GEMM large: m=n=1024, k=4096 (training batch). */
    {
        int m = 1024, n = 1024, k = 4096;
        double flops = 2.0 * m * n * k;
        double bytes = (double)m * k * 2 + (double)k * n * 2 + (double)m * n * 2;
        report_op("GEMM large (m=n=1024, k=4096)", flops, bytes, H100_RIDGE);
    }

    /* Softmax over vocab (V=32K). FLOPs: ~5·V (exp + sum + div). Bytes: 2·V·2 (in, out).
       AI ≈ 5V / 4V = 1.25 F/B. Memory-bound. */
    {
        int V = 32000;
        double flops = 5.0 * V;
        double bytes = 2.0 * V * 2;
        report_op("softmax (V=32K, fp16)", flops, bytes, H100_RIDGE);
    }

    /* LayerNorm: same shape op as softmax. Memory-bound. */
    {

Output:

Roofline analysis — H100 reference
  Peak bf16 FLOPs: 1.00e+15
  Peak HBM BW:     3.35e+12 B/s
  Ridge AI:        298.5 FLOPs/byte

Common LLM operations at typical sizes:
  dot product (N=4096, fp16)      AI=   0.5 F/B  memory-bound
  GEMV (m=n=4096, fp16)           AI=   1.0 F/B  memory-bound
  GEMM small (m=n=8, k=4096)      AI=   4.0 F/B  memory-bound
  GEMM medium (m=n=128, k=4096)   AI=  63.0 F/B  memory-bound
  GEMM large (m=n=1024, k=4096)   AI= 455.1 F/B  COMPUTE-BOUND
  softmax (V=32K, fp16)           AI=   1.2 F/B  memory-bound
  LayerNorm (d=4096)              AI=   1.3 F/B  memory-bound

Read carefully: every LLM operation EXCEPT large GEMM (m, n ≥ 1024) is memory-bound on H100. Dot products, vector ops, softmax, LayerNorm, even small/medium GEMM — all bandwidth-bound, meaning the GPU’s compute units sit idle waiting for HBM.

— think, then check —

Derivation:

H100: peak compute = 1 PFLOPs (bf16) = 10^15 FLOPs/s. Peak HBM bandwidth = 3.35 TB/s = 3.35 × 10^12 B/s.

Ridge AI = peak_FLOPs / peak_bandwidth = 10^15 / 3.35×10^12 ≈ 298.5 FLOPs/byte.

This means: for the H100 to fully use its compute, the operation must do at least 300 FLOPs of work per byte loaded from HBM. Anything less and HBM bandwidth limits the throughput.

For an op with AI = 100 (below ridge):

Memory-bound. Theoretical max throughput = peak_BW × AI = 3.35 TB/s × 100 FLOPs/B = 335 TFLOPs/s.

Actual achievable: 335 TFLOPs/s, or 33.5% of peak compute. The other 66.5% of the FLOP capacity sits idle because HBM can’t feed data fast enough.

For an op with AI = 500 (above ridge):

Compute-bound. Theoretical max throughput = peak_FLOPs = 1 PFLOPs/s.

Bandwidth needed at full compute = peak_FLOPs / AI = 1e15 / 500 = 2 TB/s. HBM provides 3.35 TB/s; bandwidth is not the bottleneck.

The HBM is “over-fed” — bytes arrive faster than the compute units can use them. The compute units run at full speed.

Operational implication:

For AI = 100: 2/3 of the GPU’s potential is wasted. Strategies to fix: reduce bytes loaded (smaller dtype, fusion, blocking) or increase compute per byte (tiling, reuse, batching).

For AI = 500: the GPU is fully utilised. To go faster, you need a different (faster) GPU.

The two regimes have COMPLETELY DIFFERENT optimisation strategies. Profiling the regime first is essential before optimising.

What this means for LLM systems

Where LLM time actually goes on H100: Training (batch ~4M tokens): Most GEMMs are LARGE (m=n=4096 batch · seqlen) — compute-bound. Compute utilisation: 50-70% of peak FLOPs. Bottleneck: compute (and inter-GPU bandwidth for parameter sync). Inference batch=1 (single user, single token decode): Most GEMMs are GEMV-sized (m=n=1) — extremely memory-bound. Compute utilisation: <10% of peak FLOPs. Bottleneck: HBM bandwidth, full stop. Inference batch=8-16 (multi-user serving): GEMMs are medium-sized (m=n=8-16) — still memory-bound. Compute utilisation: 10-25% of peak. Bottleneck: HBM bandwidth. Long-context attention (T=128K): Attention's score matrix is HUGE → memory-bound. With FlashAttention: never materialised in HBM → bandwidth saved. Compute utilisation: 30-50% of peak.

The pattern is clear: LLM inference is memory-bandwidth-bound; LLM training is closer to compute-bound. This shapes the entire hardware/software trajectory:

— think, then check —

HBM (HBM3):

  • Capacity: 80 GB.
  • Bandwidth: 3.35 TB/s.
  • Used for: model weights, KV cache, activations between layers.
  • Per access: ~400-1000 cycles latency.
  • Architecturally: stacked DRAM dies physically next to the GPU die.

L2 cache:

  • Capacity: 50 MB (much smaller than HBM but global to all SMs).
  • Bandwidth: ~5 TB/s (per SM access).
  • Used for: small frequently-accessed data; caching HBM accesses.
  • Per access: ~200 cycles latency.
  • Architecturally: on the GPU die, shared across all SMs.

SRAM (per-SM L1 + shared memory):

  • Capacity: 228 KB per SM (132 SMs × 228 KB ≈ 30 MB total).
  • Bandwidth: very high per SM (~5 TB/s — wider than HBM per SM).
  • Used for: SM-local computation, CUDA shared memory, kernel scratch space.
  • Per access: ~30 cycles latency.
  • Architecturally: SRAM cells on the SM, private to that SM.

Practical use:

  • Model weights: HBM (won’t fit in SRAM).
  • Per-layer activations: HBM, sometimes streamed through L2.
  • Per-thread-block tile for matmul: SRAM (this is what FlashAttention exploits).
  • Per-thread accumulator registers: register file (fastest).

The optimisation principle: keep computation in SRAM as long as possible; minimise HBM round-trips. Every LLM kernel optimisation is essentially “stage more work in SRAM.”

— think, then check —

(a) Softmax over 32K vocab:

FLOPs per token: ~5 · 32K = 160K. Bytes per token: 32K · 2 (read) + 32K · 2 (write) = 128K. AI = 160K / 128K = 1.25.

Heavily memory-bound (far below ridge ~300).

Optimisation: minimise HBM traffic. Strategies:

  • Online softmax (Ch.12 §3): single pass instead of three; saves 2× bandwidth.
  • Fused softmax + sampling: don’t write intermediate probabilities to HBM if you only need to sample once.
  • Smaller vocab through better tokenisation: directly reduces bytes.

(b) Batched training matmul (m=n=16K, k=4096):

FLOPs: 2 · 16K · 16K · 4K = 2.2 TFLOPs. Bytes: 2 · (16K · 4K + 16K · 4K + 16K · 16K) ≈ 768 MB. AI = 2.2e12 / 7.68e8 ≈ 2860.

Far above ridge ~300. Compute-bound.

Optimisation: maximise FLOP utilisation. Strategies:

  • Tensor cores (next section): the only way to actually hit peak FLOPs.
  • Tile sizing for SM register/SRAM capacity: hide latency, maximise reuse.
  • Pipelining the load/compute/store stages.
  • cuBLAS or hand-tuned kernels (e.g., CUTLASS).

(c) Attention Q·K^T at T=8K, d=128:

Per head: FLOPs = 2 · 8K · 8K · 128 = 17 GFLOPs. Bytes = 8K · 128 · 2 · 3 (Q, K, S) = 6 MB. AI = 17e9 / 6e6 ≈ 2800.

Compute-bound at this size. Surprising? It is — but Q·K^T is a square N²·d matmul where N²·d is large enough to push above ridge.

Optimisation: same as large GEMM — tensor cores, tiling, pipelining. The whole REASON FlashAttention works is that it keeps this matmul + softmax + the second matmul ALL in SRAM, avoiding the HBM bottleneck for the intermediate matrices.

(d) Embedding lookup:

Per token: bytes = 1 row of embedding table = d · 2 = 8 KB. FLOPs = 0 (it’s just a gather).

AI = 0. Pure memory-bound.

Optimisation: this op is so cheap (kilobytes per token) that it’s essentially free; no specific optimisation needed. The cost is the L2 cache miss when looking up a “cold” embedding. Strategies:

  • Sort tokens by frequency / cluster similar ones so common embeddings stay warm.
  • Use shared embeddings (input + output tied) so the same table serves both.

This is why embedding is rarely a measured bottleneck: it’s already cheap and well-cached.

Next: §21.2 — Tensor cores, fp8, NVLink. The specific hardware features that distinguish H100/B200 from prior GPUs and shape how LLM workloads run.