GPU memory hierarchy + the roofline model
Throughout Parts III and IV we’ve repeated a phrase: “bandwidth-bound, not compute-bound.” It’s the central observation that drives FlashAttention, MoE inference, KV cache optimisation, and quantization. This section makes it rigorous. The “roofline model” (Williams 2009) classifies every operation by its arithmetic intensity — FLOPs per byte loaded from memory — and predicts whether the operation will be limited by peak FLOPs or by memory bandwidth. The boundary is the “ridge point”, set by hardware as peak_FLOPs / peak_bandwidth. For H100 at ~300 FLOPs/byte, almost every LLM operation EXCEPT large GEMM falls BELOW the ridge — meaning the hardware spends most of its time waiting for HBM. This is why LLM inference is bandwidth-bound; this is the structural fact that shapes the entire systems stack.
The GPU memory hierarchy
Modern GPUs have a 3-tier (sometimes 4-tier) memory hierarchy:
HBM is where everything lives. SRAM is the fast scratchpad. The art of GPU kernel design is staging data through SRAM as much as possible.
Arithmetic intensity
The key metric: how many FLOPs of work does each loaded byte enable?
Arithmetic intensity is the operation’s intrinsic property — independent of which GPU you run on. The hardware’s ridge point determines which regime applies.
The roofline equation
The ridge point varies dramatically across hardware. A B200 needs ~2× higher arithmetic intensity than H100 to be compute-bound; a Mac Studio’s lower ridge means more ops can be compute-bound there (relatively easier to feed the smaller compute capacity).
Now make it run
The kernel computes arithmetic intensity for common LLM operations and labels each by roofline regime:
AI = 2N / (2N·2) = 0.5 F/B. Memory-bound. */
{
int N = 4096;
double flops = 2.0 * N;
double bytes = 2.0 * N * 2;
report_op("dot product (N=4096, fp16)", flops, bytes, H100_RIDGE);
}
/* GEMV: y = M · x. M is m × n, x is n.
FLOPs: 2·m·n. Bytes: m·n·2 (for M, the dominant input) + n·2 (for x) ≈ 2·m·n.
AI = 2·m·n / (2·m·n) = 1 F/B. Memory-bound. */
{
int m = 4096, n = 4096;
double flops = 2.0 * m * n;
double bytes = (double)m * n * 2 + (double)n * 2 + (double)m * 2;
report_op("GEMV (m=n=4096, fp16)", flops, bytes, H100_RIDGE);
}
/* GEMM small: C = A · B. A is m × k, B is k × n. m=n=8 (decode batch), k=4096.
FLOPs: 2·m·n·k. Bytes: m·k·2 + k·n·2 + m·n·2.
For m=n=8: bytes dominated by A and B (similar sizes); AI is low. */
{
int m = 8, n = 8, k = 4096;
double flops = 2.0 * m * n * k;
double bytes = (double)m * k * 2 + (double)k * n * 2 + (double)m * n * 2;
report_op("GEMM small (m=n=8, k=4096)", flops, bytes, H100_RIDGE);
}
/* GEMM medium: m=n=128, k=4096 (still typical batch size for inference). */
{
int m = 128, n = 128, k = 4096;
double flops = 2.0 * m * n * k;
double bytes = (double)m * k * 2 + (double)k * n * 2 + (double)m * n * 2;
report_op("GEMM medium (m=n=128, k=4096)", flops, bytes, H100_RIDGE);
}
/* GEMM large: m=n=1024, k=4096 (training batch). */
{
int m = 1024, n = 1024, k = 4096;
double flops = 2.0 * m * n * k;
double bytes = (double)m * k * 2 + (double)k * n * 2 + (double)m * n * 2;
report_op("GEMM large (m=n=1024, k=4096)", flops, bytes, H100_RIDGE);
}
/* Softmax over vocab (V=32K). FLOPs: ~5·V (exp + sum + div). Bytes: 2·V·2 (in, out).
AI ≈ 5V / 4V = 1.25 F/B. Memory-bound. */
{
int V = 32000;
double flops = 5.0 * V;
double bytes = 2.0 * V * 2;
report_op("softmax (V=32K, fp16)", flops, bytes, H100_RIDGE);
}
/* LayerNorm: same shape op as softmax. Memory-bound. */
{Output:
Roofline analysis — H100 reference
Peak bf16 FLOPs: 1.00e+15
Peak HBM BW: 3.35e+12 B/s
Ridge AI: 298.5 FLOPs/byte
Common LLM operations at typical sizes:
dot product (N=4096, fp16) AI= 0.5 F/B memory-bound
GEMV (m=n=4096, fp16) AI= 1.0 F/B memory-bound
GEMM small (m=n=8, k=4096) AI= 4.0 F/B memory-bound
GEMM medium (m=n=128, k=4096) AI= 63.0 F/B memory-bound
GEMM large (m=n=1024, k=4096) AI= 455.1 F/B COMPUTE-BOUND
softmax (V=32K, fp16) AI= 1.2 F/B memory-bound
LayerNorm (d=4096) AI= 1.3 F/B memory-bound
Read carefully: every LLM operation EXCEPT large GEMM (m, n ≥ 1024) is memory-bound on H100. Dot products, vector ops, softmax, LayerNorm, even small/medium GEMM — all bandwidth-bound, meaning the GPU’s compute units sit idle waiting for HBM.
Derivation:
H100: peak compute = 1 PFLOPs (bf16) = 10^15 FLOPs/s. Peak HBM bandwidth = 3.35 TB/s = 3.35 × 10^12 B/s.
Ridge AI = peak_FLOPs / peak_bandwidth = 10^15 / 3.35×10^12 ≈ 298.5 FLOPs/byte.
This means: for the H100 to fully use its compute, the operation must do at least 300 FLOPs of work per byte loaded from HBM. Anything less and HBM bandwidth limits the throughput.
For an op with AI = 100 (below ridge):
Memory-bound. Theoretical max throughput = peak_BW × AI = 3.35 TB/s × 100 FLOPs/B = 335 TFLOPs/s.
Actual achievable: 335 TFLOPs/s, or 33.5% of peak compute. The other 66.5% of the FLOP capacity sits idle because HBM can’t feed data fast enough.
For an op with AI = 500 (above ridge):
Compute-bound. Theoretical max throughput = peak_FLOPs = 1 PFLOPs/s.
Bandwidth needed at full compute = peak_FLOPs / AI = 1e15 / 500 = 2 TB/s. HBM provides 3.35 TB/s; bandwidth is not the bottleneck.
The HBM is “over-fed” — bytes arrive faster than the compute units can use them. The compute units run at full speed.
Operational implication:
For AI = 100: 2/3 of the GPU’s potential is wasted. Strategies to fix: reduce bytes loaded (smaller dtype, fusion, blocking) or increase compute per byte (tiling, reuse, batching).
For AI = 500: the GPU is fully utilised. To go faster, you need a different (faster) GPU.
The two regimes have COMPLETELY DIFFERENT optimisation strategies. Profiling the regime first is essential before optimising.
What this means for LLM systems
The pattern is clear: LLM inference is memory-bandwidth-bound; LLM training is closer to compute-bound. This shapes the entire hardware/software trajectory:
- Hardware: each successive generation increases HBM bandwidth more than peak FLOPs. H100 → B200: 2.5× HBM bandwidth, 4× FLOPs. The ridge has SHIFTED in B200, but bandwidth is still increased aggressively because that’s the bottleneck.
- Quantization: smaller dtypes reduce bytes loaded per element → improve AI for memory-bound ops → real wall-clock speedup. This is why int4 inference is ~3× faster than fp16 even though FLOPs are identical.
- Batching: larger inference batches push GEMMs from GEMV-sized to GEMM-sized, moving them across the ridge from memory-bound to compute-bound.
HBM (HBM3):
- Capacity: 80 GB.
- Bandwidth: 3.35 TB/s.
- Used for: model weights, KV cache, activations between layers.
- Per access: ~400-1000 cycles latency.
- Architecturally: stacked DRAM dies physically next to the GPU die.
L2 cache:
- Capacity: 50 MB (much smaller than HBM but global to all SMs).
- Bandwidth: ~5 TB/s (per SM access).
- Used for: small frequently-accessed data; caching HBM accesses.
- Per access: ~200 cycles latency.
- Architecturally: on the GPU die, shared across all SMs.
SRAM (per-SM L1 + shared memory):
- Capacity: 228 KB per SM (132 SMs × 228 KB ≈ 30 MB total).
- Bandwidth: very high per SM (~5 TB/s — wider than HBM per SM).
- Used for: SM-local computation, CUDA shared memory, kernel scratch space.
- Per access: ~30 cycles latency.
- Architecturally: SRAM cells on the SM, private to that SM.
Practical use:
- Model weights: HBM (won’t fit in SRAM).
- Per-layer activations: HBM, sometimes streamed through L2.
- Per-thread-block tile for matmul: SRAM (this is what FlashAttention exploits).
- Per-thread accumulator registers: register file (fastest).
The optimisation principle: keep computation in SRAM as long as possible; minimise HBM round-trips. Every LLM kernel optimisation is essentially “stage more work in SRAM.”
(a) Softmax over 32K vocab:
FLOPs per token: ~5 · 32K = 160K. Bytes per token: 32K · 2 (read) + 32K · 2 (write) = 128K. AI = 160K / 128K = 1.25.
Heavily memory-bound (far below ridge ~300).
Optimisation: minimise HBM traffic. Strategies:
- Online softmax (Ch.12 §3): single pass instead of three; saves 2× bandwidth.
- Fused softmax + sampling: don’t write intermediate probabilities to HBM if you only need to sample once.
- Smaller vocab through better tokenisation: directly reduces bytes.
(b) Batched training matmul (m=n=16K, k=4096):
FLOPs: 2 · 16K · 16K · 4K = 2.2 TFLOPs. Bytes: 2 · (16K · 4K + 16K · 4K + 16K · 16K) ≈ 768 MB. AI = 2.2e12 / 7.68e8 ≈ 2860.
Far above ridge ~300. Compute-bound.
Optimisation: maximise FLOP utilisation. Strategies:
- Tensor cores (next section): the only way to actually hit peak FLOPs.
- Tile sizing for SM register/SRAM capacity: hide latency, maximise reuse.
- Pipelining the load/compute/store stages.
- cuBLAS or hand-tuned kernels (e.g., CUTLASS).
(c) Attention Q·K^T at T=8K, d=128:
Per head: FLOPs = 2 · 8K · 8K · 128 = 17 GFLOPs. Bytes = 8K · 128 · 2 · 3 (Q, K, S) = 6 MB. AI = 17e9 / 6e6 ≈ 2800.
Compute-bound at this size. Surprising? It is — but Q·K^T is a square N²·d matmul where N²·d is large enough to push above ridge.
Optimisation: same as large GEMM — tensor cores, tiling, pipelining. The whole REASON FlashAttention works is that it keeps this matmul + softmax + the second matmul ALL in SRAM, avoiding the HBM bottleneck for the intermediate matrices.
(d) Embedding lookup:
Per token: bytes = 1 row of embedding table = d · 2 = 8 KB. FLOPs = 0 (it’s just a gather).
AI = 0. Pure memory-bound.
Optimisation: this op is so cheap (kilobytes per token) that it’s essentially free; no specific optimisation needed. The cost is the L2 cache miss when looking up a “cold” embedding. Strategies:
- Sort tokens by frequency / cluster similar ones so common embeddings stay warm.
- Use shared embeddings (input + output tied) so the same table serves both.
This is why embedding is rarely a measured bottleneck: it’s already cheap and well-cached.
Next: §21.2 — Tensor cores, fp8, NVLink. The specific hardware features that distinguish H100/B200 from prior GPUs and shape how LLM workloads run.