Tensor cores, fp8, NVLink
When NVIDIA quotes “1 PFLOPs” for an H100, the number isn’t coming from the general-purpose CUDA cores you might know from earlier GPUs — it’s coming from tensor cores, specialised hardware blocks that do small matmul operations (16×16 or 16×8 tiles) in a single instruction. Tensor cores are the reason modern GPU FLOPs are 10× higher than general FLOPs at the same transistor count. Stack on top of this fp8 and fp4 — lower-precision data types that further multiply effective FLOPs — plus NVLink for fast GPU-to-GPU communication, and you have the trio that defines the modern LLM training and inference stack. This section walks each piece and the H100 → B200 generation jump.
Tensor cores — what they actually do
Tensor cores are the reason you can buy a GPU with “1 PFLOPs of compute.” The general CUDA cores on the same chip provide ~50 TFLOPs (20× less) — so most of the chip’s flop budget is in the tensor cores.
For tensor cores to be usable, the matmul must:
- Be large enough — minimum dimensions ~16 along each axis. Tiny matmuls bypass tensor cores entirely.
- Have the right dtype — bf16, fp16, int8, fp8, etc. fp32 inputs aren’t tensor-core eligible (you’d downcast to bf16 first).
- Be properly tiled — the kernel must emit the specific
mma.syncinstructions (or wmma equivalents). cuBLAS, CUTLASS, Triton, and torch.compile all do this; hand-written naive matmul kernels often don’t.
fp8 and fp4 — the new precisions
Lower precision means:
- Each element is fewer bytes (saves bandwidth — recall §21.1).
- Tensor cores can do MORE operations per cycle (since they’re packed for the dtype).
The fp8 story is interesting. Two formats exist:
- fp8 E4M3: 4 exponent bits, 3 mantissa bits. Better for weights (less range needed).
- fp8 E5M2: 5 exponent bits, 2 mantissa bits. Better for activations / gradients (more range needed).
Production training in fp8 typically uses E4M3 for the forward pass and E5M2 for the backward pass, with per-tensor scale factors to bring values into the format’s representable range. Combined: ~2× faster training than bf16 with comparable convergence.
fp8 E4M3 / E5M2 are the formats that enable practical fp8 training. NVIDIA’s “Transformer Engine” library handles the per-layer scale management automatically.
The gap:
H100 has TWO compute paths: general CUDA cores (~50 TFLOPs total) and tensor cores (~1000 TFLOPs total). A kernel that only uses general CUDA cores is limited to the 50 TFLOPs — about 5% of the chip.
cuBLAS dispatches large matmuls to tensor cores, getting 750+ TFLOPs — ~75% of peak.
Tensor cores’ specific role:
One H100 tensor core instruction performs:
D = A · B + C, where A is 16×16, B is 16×8, C is 16×8 (bf16 inputs, fp32 accumulate).
That’s 16·16·8 = 2048 multiply-adds = 4096 FLOPs per cycle per tensor core.
To use tensor cores: emit mma.sync or wmma::mma_sync PTX instructions (or use a high-level API like CUTLASS or Triton that emits them).
What makes a kernel tensor-core-friendly:
- Right dtype: bf16, fp16, int8, fp8, fp4 (depending on generation). fp32 is NOT tensor-core eligible on H100 (would need bf16 downcast first).
- Sufficient size: matrix dims ≥ 16 along each axis. Smaller dims fall back to general CUDA cores.
- Aligned addresses: tensor cores work on 16-byte aligned data. Misaligned loads waste cycles.
- Tiled to SRAM: the A, B, C tiles must fit in SRAM/registers. A naive load-from-HBM-each-step kernel can’t feed tensor cores fast enough.
- Proper accumulation: tensor cores accumulate in fp32 (or higher precision than inputs). Kernels that don’t preserve this accuracy lose quality.
- Pipelining: overlap load + compute. Tensor core’s 4096 FLOPs/cycle is wasted if data arrives slowly.
cuBLAS implements all of these. CUTLASS is a NVIDIA open-source library that exposes the building blocks for custom kernels. Triton (next chapter) is a higher-level DSL that emits tensor-core kernels automatically.
Hand-written naive kernels usually miss several of these — explaining the 15× gap.
NVLink and the multi-GPU communication problem
A single H100 has 80 GB of HBM. A 70B model in fp16 needs 140 GB. You can’t fit Llama 2 70B on one H100; you need multi-GPU.
For training, you need 4-16K H100s coordinating. For inference, often 2-8 GPUs per request. Either way, you need fast GPU-to-GPU communication.
The NVLink number (900 GB/s) is huge compared to PCIe (64 GB/s) — about 14× faster. This is why “multi-GPU training in a single node” is dramatically faster than “multi-GPU across PCIe-connected nodes”: NVLink doesn’t hit the PCIe bottleneck.
For multi-node training (when you need more than 8 GPUs), NVLink ends at the node boundary. Inter-node traffic goes over InfiniBand at ~50 GB/s — still fast, but ~18× slower than NVLink. This is why training-job topology matters: certain operations (all-reduce within a TP group) want to stay intra-node; others (DP all-reduces) can spread across nodes.
Why everyone doesn’t use it (yet):
fp8 has 7-bit precision (E4M3) or 5-bit precision (E5M2) plus exponent. Compared to bf16’s 7-bit mantissa and 8-bit exponent, fp8 has:
- Much narrower range (E4M3 maxes at ±240; bf16 maxes at ±3.4e38).
- Lower precision per representable value.
This narrow range means values can OVERFLOW (gradients spike) or UNDERFLOW (small gradients become zero) easily. Without intervention, fp8 training diverges within a few hundred steps.
What can fail:
- Activation overflow: some layers produce activations > 240. In fp8, they saturate to ±240, losing information.
- Gradient underflow: small gradients (~1e-5) become 0 in fp8. Effective learning rate drops to zero for those parameters.
- Loss divergence: accumulated errors from fp8 compound; the model fails to converge.
- Optimizer state corruption: Adam’s m, v stats need higher precision than fp8 provides.
NVIDIA Transformer Engine’s solution:
- Per-tensor scaling: before each matmul, compute a SCALE FACTOR for the tensor: scale = max(|x|) / fp8_max. Multiply the tensor by 1/scale before the matmul; multiply the output by scale after. This brings the values into the representable range.
- Mixed format: use E4M3 for forward pass (weights, activations) where precision matters more than range. Use E5M2 for backward pass (gradients) where range matters more than precision. Switch per direction.
- Master copy in higher precision: keep weights as bf16 or fp32 master. Cast to fp8 only at matmul time. Optimizer state (Adam m, v) stays fp32.
- Recipe tuning: empirically derived combinations of warmup, learning rate, and gradient clipping that make fp8 stable.
- Selective fp8: some layers (LayerNorm, final layer) stay in bf16. Most matmuls use fp8. Per-layer dtype selection.
Empirical state:
NVIDIA Transformer Engine + careful tuning achieves fp8 training that matches bf16 perplexity within ~0.01 on Llama-class models. Speedup: 1.5-2× vs bf16.
Production use: GPT-5 reportedly trained partially in fp8. Anthropic’s Claude pretraining uses fp8 selectively. Meta’s research papers describe fp8 training.
The bar is being lowered each year as tooling improves, but as of 2025, fp8 training requires the Transformer Engine library; pure bf16 is still the “safe default” for production training.
The H100 → B200 jump
The Blackwell architecture’s key innovations beyond raw scale:
- fp4 support. Inference at half the bandwidth (and 2× the FLOPs) of fp8. Combined with appropriate quantization (Ch.24), enables 1T+ parameter inference on a single Blackwell node.
- Larger NVLink domain. B200 NVSwitch supports 72 GPUs in a single coherent domain (vs 8 for H100), making “single-rack inference” of 1T+ models feasible.
- Specialised transformer instructions. New PTX-level support for “transformer engine” operations (fused attention, fused norm, etc.).
Setup:
16K H100s = 2000 nodes × 8 GPUs/node. Each node has 8 GPUs connected by full NVLink mesh. Nodes are interconnected by InfiniBand (or similar high-speed inter-node fabric).
The 70B model is partitioned via tensor parallelism (TP) and data parallelism (DP) — and maybe pipeline parallelism (PP) too.
Typical 70B-class split:
- Tensor parallel: TP=8 (split each layer’s matmul across the 8 GPUs of one node). NVLink-intensive.
- Pipeline parallel: PP=4 (split the 80 layers into 4 sequential pipeline stages). Inter-node communication.
- Data parallel: DP = 16K / (8 × 4) = 500 ranks. Inter-node.
Data flow per gradient step:
1. Forward pass:
Within each TP group (8 GPUs in a node): every linear layer needs an all-reduce of partial outputs. ~140 GB / 8 = 17.5 GB per all-reduce, ~32 layers × 2 all-reduces/layer = 64 all-reduces. Total: ~1.1 TB of NVLink traffic per forward step. At 900 GB/s, takes ~1 second.
Between PP stages: each rank sends its activations to the next PP stage. Crosses node boundaries → InfiniBand. ~few GB per microbatch.
2. Backward pass:
Symmetric to forward. ~2× the data movement.
3. All-reduce for DP:
At the end of the step, gradients across the 500 DP ranks are all-reduced. ~140 GB of gradients per rank. The all-reduce is a tree/ring operation over 500 ranks across the InfiniBand fabric. Heavy: takes seconds.
Why the assignment matters:
If TP were split ACROSS nodes (instead of within), every layer’s all-reduce would use InfiniBand (50 GB/s) instead of NVLink (900 GB/s) — 18× slower. A single step would take 30+ seconds instead of 6.
If DP were ALSO confined within nodes, the DP all-reduce would still be intra-node — fast — but you’d be limited to 8-way DP. With 16K GPUs you’d run out of TP/PP/EP dimensions; you’d waste GPUs.
The optimal split puts the HIGHEST-BANDWIDTH operations (TP all-reduces, per-layer comms) on the FASTEST link (NVLink intra-node). The LOWER-BANDWIDTH ops (DP all-reduce, once per step) go on InfiniBand.
This is why pre-training engineers obsess over “the parallelism strategy” — it’s not just about dividing the model; it’s about matching the comms hierarchy to the bandwidth hierarchy.
Next: §21.3 — TPUs, Apple Silicon, AMD MI300X. The non-NVIDIA landscape: why TPUs are different, what Apple Silicon’s unified memory gets you, and where AMD stands.