Continuous batching + prefill/decode disaggregation
PagedAttention solved the memory problem. The next bottleneck is compute. Static batching — collect N requests, run them as a batch, wait for all to finish — wastes massive compute because requests have different lengths. The slowest request gates the whole batch; faster requests finish and sit idle. Continuous batching (Yu 2022 “Orca”) solves this by letting requests join and leave the batch at each token-generation step. Combined with prefill/decode disaggregation — running the two different phases on different GPUs — this is the architecture that powers every commercial LLM API in 2025.
Why static batching wastes compute
Imagine three requests in a static batch:
The waste is structural. Different requests have different generation lengths; the slow ones force fast ones to wait. Static batching only works well when all requests are similar length, which is rarely true.
Continuous batching — requests enter and leave dynamically
Yu 2022 (“Orca: A Distributed Serving System for Transformer-Based Generative Models”) introduced the fix: at each token-generation step, REORGANISE the batch:
Continuous batching dramatically improves GPU utilisation. The compute that would have been wasted on padding is reallocated to new requests joining the batch.
The challenge: at each step, you’re computing one new token for many active requests, each with a DIFFERENT context length and DIFFERENT KV cache. The attention kernel has to handle variable-length contexts within a single batch. This is where PagedAttention’s block tables become essential — each request’s KV cache is a list of blocks; the kernel just iterates through each request’s blocks.
Prefill vs decode — two distinct phases
For each request, generation has two phases with COMPLETELY DIFFERENT compute profiles:
Prefill and decode are fundamentally different from a compute perspective. Prefill is compute-bound (you can use full GPU FLOPs); decode is memory-bound (you’re limited by HBM bandwidth pulling weights).
Data structures that change at each step:
1. Active request list: a dynamic list (vector / deque) tracking currently-batched requests. At each step:
- Remove requests that finished (EOS or max_tokens).
- Add new requests from the pending queue if memory + compute capacity allows.
2. Per-request state: each active request has:
- Current generated tokens (a list).
- Block table (pointers to its KV cache blocks).
- Logit-bias parameters, sampling temperature, etc.
- Generation step counter, max_tokens limit.
3. Pending queue: incoming requests waiting for a slot. Sorted by priority or FIFO.
Kernel changes required:
1. Variable-length attention: the attention kernel must handle a batch where each request has different context length AND different KV cache layout (block table). FlashAttention 2 has a “varlen” variant specifically for this. The kernel iterates per-request, reading from the request’s block-table-indexed memory.
2. Per-request sampling: after computing logits for all active requests in parallel, sample each request’s next token with its OWN parameters (different temperatures, top-p, etc.). This is a per-request branch in the sampling kernel.
3. Memory management: request finishes → block table freed → blocks returned to pool. Request starts → blocks allocated. Allocation has to be fast (often a free-list-based pool).
4. Batch packing: at each step, the active requests’ inputs and KV caches need to be efficiently laid out for the attention kernel. vLLM uses a “block table tensor” that the kernel reads to find each request’s blocks.
Scheduling logic:
1. When to add new requests: balance throughput (more requests = more parallelism) vs latency (more requests = larger batch = slower per-step).
2. How to handle preemption: if a long request blocks newer faster ones, can you “pause” it (save state, free blocks, resume later)? vLLM supports this for fairness.
3. Prefill vs decode interleaving: should new requests be prefilled IMMEDIATELY (interrupting decode), or batched with other prefills? Different systems make different choices (§23.3 covers disaggregation).
The implementation is non-trivial — vLLM’s scheduler alone is ~5K lines of code. But the benefit (2-5× throughput) is so large that every serious inference engine has built one.
Disaggregated serving — prefill on big GPUs, decode on small
The natural extension: since prefill and decode have different compute profiles, run them on different hardware.
Disaggregated serving is the latest evolution. Not all systems use it yet, but it’s the direction high-end commercial LLM serving is heading.
Why prefill is compute-bound:
Prefill processes N input tokens at once. The matmuls involved are:
- QKV projection: X (N × d) · W (d × d) = ON × d) matmul. With N=4K and d=4096: m=4K, k=4K → arithmetic intensity ~4096. Far above ridge.
- FFN: (N × d) · (d × F): m=N=4K, k=d=4K → high AI.
- Attention QK^T: (N × d) · (d × N): N²d = 67 GFLOPs at N=4K. Also high AI.
All operations are large enough to saturate tensor cores. H100 utilisation: 50-70% of peak FLOPs (~500 TFLOPs of bf16). The GPU is being used for what it’s designed for.
Why decode is memory-bound:
Decode processes ONE token at a time. The matmuls involved are:
- QKV projection: x (1 × d) · W (d × d) — this is a GEMV, not GEMM. m=1, k=d → AI ≈ 1 FLOP/byte. Far below ridge.
- FFN: (1 × d) · (d × F): m=1 → AI ≈ 1. Memory-bound.
- Attention against KV cache: q · K_cache (1 × N×d) — reads the KV cache from HBM. AI ≈ 1.
All operations are tiny. The actual FLOPs are minimal; the cost is loading weights and KV cache from HBM. H100 utilisation in decode: 5-15% of peak FLOPs. The compute units sit idle most of the time waiting for HBM.
Bandwidth math for decode:
Per generated token, need to load: ~70 GB of model weights (Llama 70B in fp16) + ~5-10 GB of KV cache (depending on context).
At 3.35 TB/s HBM: ~25 ms per token from bandwidth alone.
This explains why decode latency is ~30-50 ms per token on H100 for Llama 70B: it’s bottlenecked by HBM, not compute.
What this means for hardware:
- Prefill cluster: H100 / B200 / Trainium — high FLOP density. Tensor cores doing 1 PFLOPs of work.
- Decode cluster: could be cheaper hardware. AMD MI300X (more HBM, less FLOPs) might be better. Apple M2 Ultra (huge bandwidth, less FLOPs) is competitive. Cost-per-decoded-token is the metric.
The disaggregation thesis: spend FLOP money where it pays (prefill); spend bandwidth money where it pays (decode). Don’t pay for both on every GPU.
Continuous batching in practice — vLLM scheduler
vLLM’s serving loop is the canonical example:
(a) Chat (100-token prompt, 200-token response):
Prefill: 100 tokens × ~30 GFLOPs/token = ~3 TFLOPs total. Takes ~3 ms on H100. Tiny.
Decode: 200 tokens × ~30 ms/token = 6 seconds.
Time profile: 99.95% of work is in decode.
Disaggregation overhead: KV cache transfer is ~100 tokens × 320 KB = 32 MB. Transfer takes ~1 ms. Small.
BUT: the per-request prefill is so cheap that the prefill cluster gets crushed by request volume. Need many prefill GPUs to handle the request rate.
Verdict: PROBABLY NOT WORTH IT. Static unified serving with continuous batching is fine. Disaggregation adds complexity without strong benefit.
(b) RAG (16K-token prompt, 500-token response):
Prefill: 16K tokens × ~30 GFLOPs/token = ~500 TFLOPs total. Takes ~500 ms on H100 (compute-bound).
Decode: 500 tokens × ~50 ms/token = 25 seconds.
Time profile: ~2% prefill, ~98% decode. Prefill is small fraction but ABSOLUTE prefill time (500ms) is noticeable.
Disaggregation overhead: KV cache transfer is 16K × 320 KB = 5 GB. At 50 GB/s InfiniBand: ~100 ms. Significant but acceptable.
Win: prefill happens on FLOP-rich hardware (fast); decode happens on bandwidth-rich hardware (cheaper); KV transfer is amortised across a long generation.
Verdict: WORTH IT. Especially if you have many concurrent RAG requests (the prefill compute is the binding constraint).
(c) Code completion (4K-token prompt, 2K-token response):
Prefill: 4K tokens × ~30 GFLOPs/token = ~120 TFLOPs. Takes ~120 ms.
Decode: 2K tokens × ~30 ms/token = 60 seconds.
Time profile: 0.2% prefill, 99.8% decode.
Disaggregation overhead: 4K × 320 KB = 1.3 GB. Takes ~25 ms.
Verdict: MARGINAL. The transfer overhead is significant relative to prefill time. If the workload is consistent, disaggregation pays; if mixed with chat, the architecture complexity may not be worth it.
The deeper rule:
Disaggregation wins when:
- Prefill is a significant absolute cost (long prompts).
- The prefill compute is the binding constraint (you’d need to add more H100s otherwise).
- The decode infrastructure can use cheaper hardware.
- The KV transfer overhead is amortised across a long generation.
For most “chat-only” workloads: unified serving + continuous batching + PagedAttention is enough. For mixed RAG/long-context workloads: disaggregation is the next step.
The trend: as context lengths grow (1M tokens becoming common), disaggregation will become more universal because long-prefill workloads benefit so much.
Next: §23.3 — Speculative decoding deep dive. The third major inference optimisation: a draft model proposes K tokens, the target verifies them in parallel. 2-3× throughput at no quality cost.