INFERENCE AT SCALE
Section 23.1
01

KV cache + PagedAttention

The KV cache (Ch.13 §2) is the central memory cost of LLM inference. For a 70B model serving 128K context, the cache alone is ~10 GB per request. Multiply by the number of concurrent users you want to serve, and KV cache memory dominates everything else. The simple-but-wrong way to manage this — pre-allocate the max-context buffer per request — wastes 90%+ of memory on typical workloads where most requests are short. vLLM (Kwon 2023, PagedAttention) introduced the structural fix: borrow virtual memory paging from operating systems. Split the KV cache into fixed-size blocks (16 tokens each); allocate on demand; use a per-request block table to track which blocks belong to which request. The result is the largest single inference-throughput jump of 2023, and the paging design is now in every serious inference engine (vLLM, SGLang, TensorRT-LLM).

The KV cache recap

From Ch.13 §2: during autoregressive decoding, each new token computes Q, K, V; only K and V are CACHED so future tokens can attend to them. The cache for one request:

KV cache size for ONE request: = N_tokens · 2 (K and V) · N_layers · N_kv_heads · d_head · dtype_bytes For Llama 3 70B (L=80, H_kv=8, d_head=128, fp16): = N · 2 · 80 · 8 · 128 · 2 = 327680 bytes per token = ~320 KB/token At N = 4K tokens: 1.3 GB per request At N = 32K tokens: 10.5 GB per request At N = 128K: 42 GB per request The cache dominates inference memory budget vs the 140 GB model weights.

The cache is necessary — without it, autoregressive generation would be O(N²) per token instead of O(N), making generation unworkable. But the MEMORY CONSUMED is the binding constraint on serving throughput.

Static allocation — the simple-but-wrong approach

Before vLLM, every inference framework did this:

for each new request: allocate KV_cache_buffer of size MAX_CONTEXT (typically 4K-32K tokens) serve requests until they finish; then free their buffers. The problem: - Most requests are SHORT (200-500 tokens). - But each is allocated the FULL max-context buffer. - Most of the allocated memory is unused. Empirical measurement (vLLM paper): - On a typical chat workload: ~20% of allocated memory is actually USED. - 80% is "reserved" for a context that the request never reaches. - Result: GPU OOMs when serving 50 concurrent requests, even though those 50 requests TOGETHER would only need 10 GB of KV memory.

This is the classic internal fragmentation problem. The OS analog: imagine if every malloc allocated 4 KB regardless of the size requested. You’d run out of memory after a few hundred allocations even though you’re only using kilobytes of actual data.

PagedAttention — the fix

vLLM’s insight: treat KV cache memory like virtual memory. Split it into fixed-size BLOCKS; allocate blocks on demand; use a block table per request to track which blocks belong to whom.

PagedAttention data structure: Global block pool: a contiguous arena of BLOCK_SIZE × N_LAYERS × ... bytes. Allocator manages it as a free list of fixed-size blocks. Per request: a "block table" — an array of pointers to blocks in the pool. Each block stores BLOCK_SIZE tokens' worth of K and V values (e.g., BLOCK_SIZE = 16 tokens). When the request generates token N+1: If the last block is full: allocate a new block from the pool. Add it to the request's block table. Otherwise: append to the existing block. When the request finishes: free all its blocks back to the pool. Block table example (for a request with 50 generated tokens, BLOCK_SIZE = 16): [block_42, block_17, block_88, block_5] ↑ ↑ ↑ ↑ tokens tokens tokens tokens 48-49 (last block: partial) 0-15 16-31 32-47 Attention computation: For attention over the full context, the kernel reads from each block. The block table maps "logical position i" → "physical block + offset within block". The attention kernel handles the indirection — no contiguous memory required.

PagedAttention is the structural change. The implementation cost: the attention kernel becomes more complex (it has to follow the block table to read non-contiguous memory) but it’s a one-time engineering investment.

Now make it run

The kernel simulates a workload of 100 concurrent requests with realistic context-length distribution (lognormal — most short, some long) and computes memory usage for both allocation strategies:

kv_cache.c — main C · KV cache static vs paged
    if (ctx > MAX_CONTEXT) ctx = MAX_CONTEXT;
    if (ctx < 16) ctx = 16;
    (void)u;
    return (int)ctx;
}

int main(void) {
    int contexts[N_REQUESTS];
    int total_actual = 0;
    int max_seen = 0;
    int min_seen = 1<<30;

    for (int i = 0; i < N_REQUESTS; i++) {
        contexts[i] = sample_context_length();
        total_actual += contexts[i];
        if (contexts[i] > max_seen) max_seen = contexts[i];
        if (contexts[i] < min_seen) min_seen = contexts[i];
    }
    float mean_actual = (float)total_actual / N_REQUESTS;

    /* ---- STATIC ALLOCATION ---- */
    size_t static_alloc = (size_t)N_REQUESTS * MAX_CONTEXT * BYTES_PER_TOKEN;
    size_t static_used  = (size_t)total_actual * BYTES_PER_TOKEN;

    /* ---- PAGED ALLOCATION ---- */
    /* Each request uses ceil(context / block_size) blocks. */
    int total_blocks = 0;
    for (int i = 0; i < N_REQUESTS; i++) {
        int blocks = (contexts[i] + BLOCK_SIZE - 1) / BLOCK_SIZE;
        total_blocks += blocks;
    }
    size_t paged_alloc = (size_t)total_blocks * BLOCK_SIZE * BYTES_PER_TOKEN;
    /* (plus a tiny overhead for the block table — negligible) */

    printf("KV cache allocation: static vs paged\n");
    printf("Workload: %d concurrent requests, context lengths lognormal-distributed.\n", N_REQUESTS);
    printf("Context stats: min=%d  mean=%.0f  max=%d  total=%d tokens\n\n",
           min_seen, mean_actual, max_seen, total_actual);
    printf("Model parameters: %d layers × 2 (K,V) × %d KV heads × %d d_head × 2 bytes (fp16)\n",
           N_LAYERS, N_KV_HEADS, D_HEAD);
    printf("                 = %zu bytes per token of KV cache\n\n", BYTES_PER_TOKEN);

    printf("STATIC ALLOCATION (pre-vLLM): reserve MAX_CONTEXT for each request\n");
    printf("  Total allocated:    %.2f GB\n", static_alloc / 1e9);
    printf("  Total used:         %.2f GB\n", static_used  / 1e9);
    printf("  Utilisation:        %.1f%%\n",  100.0 * static_used / static_alloc);

Output:

Workload: 100 concurrent requests, context lengths lognormal-distributed.
Context stats: min=34  mean=193  max=631  total=19269 tokens

STATIC ALLOCATION (pre-vLLM): reserve MAX_CONTEXT for each request
  Total allocated:    107.37 GB
  Total used:         2.53 GB
  Utilisation:        2.4%
  Wasted:             104.85 GB

PAGED ATTENTION (vLLM): allocate in 16-token blocks
  Total blocks:       1253
  Total allocated:    2.63 GB
  Total used:         2.53 GB
  Utilisation:        96.1%
  Wasted:             0.10 GB

Memory saved by paging: 104.75 GB  (40.9x more requests per GPU)

The numbers are dramatic: 40× more concurrent requests per GPU at typical workloads. For commercial LLM serving, this directly translates to 40× lower per-token cost.

— think, then check —

Per-token cache size:

K + V = 2 vectors.

Per layer per head: K is d_head = 128 elements; V is d_head = 128 elements. Both in fp16 = 2 bytes each.

Per layer: 8 KV heads × (128 + 128) × 2 bytes = 8 × 256 × 2 = 4096 bytes per layer.

Total layers: 80.

Per token: 80 × 4096 = 327680 bytes ≈ 320 KB/token.

For 128K context:

128 × 1024 × 320 KB = 128 × 1024 × 320 × 1024 bytes = ~43 GB.

So one request at 128K context needs 43 GB of KV cache alone.

Serving N concurrent users:

Static allocation: N × 43 GB. On an 80 GB H100, you fit at most 1 such request (and even then the model itself takes ~140 GB so you need 2+ H100s anyway).

Paged allocation: N × (actual_context × 320 KB). If most requests are at 4K tokens (~1.3 GB), you can serve ~30 simultaneously on the same hardware.

This is the central memory-budget calculation that drives all LLM serving infrastructure decisions.

— think, then check —

After 50 tokens, with BLOCK_SIZE = 16:

Number of blocks needed = ceil(50/16) = 4 blocks.

Block 0 holds tokens 0-15 (full).

Block 1 holds tokens 16-31 (full).

Block 2 holds tokens 32-47 (full).

Block 3 holds tokens 48-49 (only 2 of 16 used).

The “block table” for this request is an array of 4 pointers: [block_id_0, block_id_1, block_id_2, block_id_3]. Each ID is an index into the global block pool. The same logical block (e.g., logical position 0) might be physical block #42 in the pool; the table provides the indirection.

Each block contains:

For BLOCK_SIZE=16 tokens × all 80 layers × 8 KV heads × 128 d_head × 2 (K, V) × 2 bytes:

= 16 × 80 × 8 × 128 × 2 × 2 = 5 MB per block (for Llama 70B-class).

For Llama 8B: 16 × 32 × 8 × 128 × 2 × 2 ≈ 2 MB per block.

When generating token 51:

The kernel checks: is the last block (block 3) full? It has 2/16 tokens. NOT full.

So just append the new K, V to block 3’s slot for token index 50 (which is the third slot of block 3).

The block table is unchanged.

If, at token 65, block 3 fills up: allocate a new block from the pool, append its index to the block table → [block_id_0, block_id_1, block_id_2, block_id_3, block_id_4_new].

When the request finishes (generates EOS or hits max tokens):

Mark all 4-5 blocks as FREE in the global block pool.

Discard the block table.

The freed blocks are immediately available for other requests.

This is exactly analogous to an OS freeing virtual memory pages when a process exits.

Attention computation across blocks:

For attention at the current position (token 51), the kernel needs K, V from ALL prior tokens (0-50). The block table tells it where each token’s K, V lives:

  • Tokens 0-15: read block_id_0 starting at offset 0.
  • Tokens 16-31: read block_id_1 starting at offset 0.
  • Etc.

The kernel issues multiple “block reads” instead of one contiguous read. There’s overhead in this indirection, but it’s small (~5-10% vs contiguous), and the memory-savings benefit (40× more concurrent requests) vastly outweighs the latency cost.

Why the block size matters

The choice of BLOCK_SIZE = 16 isn’t arbitrary:

Block size trade-off: SMALL blocks (e.g., 4 tokens): - Less internal fragmentation (closer to "exactly fits"). - More blocks per request, more indirection overhead. - Block table is larger. - More allocator overhead per request. LARGE blocks (e.g., 128 tokens): - More internal fragmentation (avg ~half a block wasted per request). - Fewer blocks per request, simpler indirection. - Better tensor-core alignment for attention kernel. vLLM default: BLOCK_SIZE = 16. - At 16 tokens × 320 KB/token = 5 MB per block (for 70B). - Avg ~2.5 MB wasted per request (half a block) — acceptable. - 16 is a "nice" power of 2 for kernel alignment. - Empirically the best trade-off across many workloads. Some systems use BLOCK_SIZE = 32 or 64 for larger contexts; 8 or 16 for many short requests.
— think, then check —

Why the gap is so dramatic:

Static allocation reserves MAX_CONTEXT for each request. If MAX_CONTEXT is set to handle the longest plausible request (say 32K tokens), then EVERY request reserves 32K × 320 KB = 10 GB.

For a typical chat request at 500 tokens: actual use is 500 × 320 KB = 160 MB. Utilisation: 160 MB / 10 GB = 1.6%.

The waste is the difference: 9.84 GB per request “reserved but unused.”

With paged allocation: chat request uses ceil(500/16) = 32 blocks × 5 MB = 160 MB allocated. Utilisation: 100% (well, 96% with last-block slack). Allocated = actually used.

What makes the workload distribution amplify this:

Chat workloads are HEAVILY skewed toward short requests:

  • 50% of requests: 100-1000 tokens.
  • 30%: 1000-5000 tokens.
  • 15%: 5000-20000 tokens.
  • 5%: 20K-100K tokens (RAG, long documents).

With static allocation, ALL of these consume MAX_CONTEXT memory. The 50% short requests waste 99% of their allocation.

With paged allocation, each request consumes its actual size. The 50% short requests use 1-3% of MAX_CONTEXT memory; the 5% long requests use most of theirs. Total: ~10-15% of static allocation.

So at the WORKLOAD level: paged saves 85-95% of memory.

Per-GPU throughput impact:

On an 80 GB H100 serving Llama 3 70B (model ~35 GB after quantization, leaving ~45 GB for KV):

  • Static: at MAX_CONTEXT=32K (10 GB/req), can serve 4 requests simultaneously.
  • Paged: at average 200 MB/req actual usage, can serve ~200 requests simultaneously.

50× more throughput per GPU. For commercial LLM serving where infrastructure cost dominates margins, this is THE optimisation.

Why nobody did this before vLLM:

The naive attention kernel reads K, V from contiguous memory. Implementing it over non-contiguous block-table-indexed memory requires kernel changes. Before vLLM, no one had built the production attention kernel that handles this efficiently. After vLLM open-sourced PagedAttention, every other engine adopted similar designs within months. It’s a classic “obvious in retrospect” optimisation.

The deeper analogy:

This is exactly the problem OSes solved with virtual memory in the 1960s. Pre-virtual-memory: programs preallocated their max memory; you couldn’t fit many programs. With paging: programs use only what they actually need; you fit many more. LLM serving was the “no virtual memory yet” era; PagedAttention brought it into the modern era.

Next: §23.2 — Continuous batching + prefill/decode disaggregation. The runtime patterns that turn PagedAttention into actual high-throughput serving.