THE HARDWARE SUBSTRATE
Section 21.3
03

TPUs, Apple Silicon, AMD MI300X — the non-NVIDIA landscape

If you read ML papers from 2017-2024, you might think NVIDIA is the entire compute industry. It isn’t. Google has shipped TPUs since 2016 with a fundamentally different architecture (systolic arrays). Apple Silicon has unified memory + a dedicated Neural Engine, changing cost economics for on-device LLMs. AMD MI300X matches H100 specs on paper with more HBM. AWS Trainium / Inferentia are AWS’s in-house alternatives. Cerebras builds wafer-scale chips. Groq targets ultra-fast inference. This section walks the major non-NVIDIA options, their architectural differences, and the honest question of why NVIDIA still wins the frontier despite all this competition.

TPUs — systolic arrays vs SIMT

Google’s TPU is the longest-running non-NVIDIA ML accelerator. The architectural difference is real:

GPU (SIMT — Single Instruction, Multiple Threads): Many cores, each running independent threads. For matmul: each thread computes one element of the output. Threads coordinate via SRAM / shared memory. Flexible: can do anything that fits the SIMT pattern. TPU (systolic array): A 2D grid of arithmetic units, each connected to its neighbours. Data FLOWS through the grid (north-south, east-west). For matmul: data flows in one direction (input), accumulator flows in another (output). Each unit does one multiply-add per cycle. Specialised: blazingly fast at matmul, less flexible for other ops. TPU v5p (the recent generation): ~459 TFLOPs in bf16 per chip ~32 GB HBM3 per chip ~95 GB/s HBM per chip Connected via dedicated ICI (inter-chip interconnect) network at 3.2 TB/s Notable: TPU's per-chip bandwidth is LOWER than H100 (95 GB/s vs 3.35 TB/s), but its compute density is HIGHER. The TPU bet: most operations can be tiled to make every byte do many FLOPs (high arithmetic intensity).

TPUs are how Google trains Gemini and serves Search’s ML. The architecture is genuinely different — a systolic array doesn’t have the same “many independent threads” model as a GPU.

The TPU bet is structural: matmul is the dominant LLM operation, so build hardware that does matmul exceptionally well, accept that everything else (custom kernels, irregular ops) is slower. Empirically: TPUs match or beat GPUs on training throughput per dollar for transformer workloads.

The catch: TPUs don’t run unmodified PyTorch. You need JAX or TensorFlow with XLA compilation. The tooling ecosystem is smaller. For a frontier lab investing $100M in a training run, the cost of porting is small; for everyone else, it’s a barrier.

— think, then check —

Systolic array structure:

A 2D grid of arithmetic units, each connected to its immediate neighbors. Imagine a 128×128 grid of FMA units, where:

  • Input A flows left-to-right across each row.
  • Input B flows top-to-bottom down each column.
  • Each unit at position (i, j) sees one element of A (at time t-i) and one of B (at time t-j); multiplies them; adds to its local accumulator.
  • Output accumulators are pulled out at the end of the computation.

For a matmul C = A · B with A as (m, k) and B as (k, n): the array computes one C tile of size (m_grid, n_grid) in m_grid + k + n_grid cycles. After warmup, the array produces results at one per cycle per output element.

Why arithmetic intensity is high:

Each element of A is LOADED ONCE from memory and flows through n_grid units (one per column). Each element of B similarly flows through m_grid units. Each load enables m_grid · k · n_grid / (m_grid + k + n_grid) ≈ k FLOPs (when m_grid, n_grid are comparable).

So per byte loaded, ~k FLOPs are computed. For typical k=4096 in LLMs: AI ≈ 4096 FLOPs/byte — far above any reasonable hardware ridge.

Effectively, the systolic array USES EACH LOADED ELEMENT MANY TIMES before discarding it. This is the same “tiling” idea as GPU matmul, but baked into the hardware geometry.

Comparison to GPU:

GPU’s SIMT model: each thread loads its own elements, uses them, discards. Reuse comes from SRAM caching: thread 1 might load A[i, j]; thread 2 might find it in cache. But there’s no GUARANTEED reuse — depends on cache hit rates.

Systolic array: reuse is STRUCTURAL. Each element CAN’T be reloaded; it flows through the grid by design. The arithmetic intensity is high by architecture, not by software optimisation.

The trade-off:

Systolic arrays are SPECIALISED. They do matmul brilliantly but struggle at irregular operations (attention’s softmax, custom kernels, branches). GPUs are flexible but require careful kernel design to achieve high intensity.

For pure matmul workloads (large training runs), TPUs often beat GPUs in throughput/$.

For diverse workloads (research, fine-tuning, exploration), GPUs win on flexibility.

Apple Silicon — unified memory changes the game

The structural difference for Apple Silicon (M2 Pro/Max/Ultra, M3 family, M4 family):

Traditional GPU: HBM is SEPARATE from CPU RAM. Model data lives in HBM (80 GB on H100). To run on more data: PCIe transfer from CPU → GPU (~64 GB/s). This is slow; can dominate inference latency for short prompts. Apple Silicon: unified memory. M2 Ultra: 192 GB unified memory. M3 Max: 128 GB unified memory. CPU and GPU access the SAME physical RAM at the SAME bandwidth. Bandwidth: 800 GB/s (M2 Ultra), 400 GB/s (M3 Max). No PCIe transfers; no copy operations between CPU and GPU. For LLM inference, this means: - A Llama 3 70B at fp16 (140 GB) FITS on M2 Ultra (192 GB unified). - On a comparable PC: 140 GB doesn't fit on any H100 (80 GB). Need 2 H100s + PCIe transfers + tensor parallelism. - M2 Ultra's 800 GB/s is "only" 1/4 of H100's bandwidth but the simpler architecture (single-device) often beats 2× the H100 setup. Plus the Apple Neural Engine (ANE): Specialised inference accelerator on-die. Runs quantized models (4-bit) at very high throughput. Used by Core ML / MLX for on-device inference.

Unified memory is Apple’s secret weapon for LLM inference. The headline: a single Mac Studio with M2 Ultra can run Llama 3 70B in fp16 — natively, no quantization, no multi-GPU. No PC at any price can do that without multi-GPU + PCIe overhead.

For training, Apple Silicon is much less compelling: lower peak FLOPs (~50 TFLOPs vs H100’s 1000), no NVLink-class interconnect for multi-machine, weaker tooling. But for INFERENCE — particularly very large models — Apple Silicon is genuinely competitive.

— think, then check —

(a) 2× H100 PC ($80K, ~1500W):

  • Memory: 2× 80 GB = 160 GB HBM. Llama 3 70B fp16 (140 GB) fits with room for KV cache.
  • Bandwidth: 2× 3.35 TB/s aggregate. Extremely fast.
  • Throughput at batch=1: ~70 tokens/s.
  • Inter-GPU comm: NVLink at 900 GB/s.
  • Software: PyTorch + CUDA fully supported.
  • Catch: $80K is hard to justify for individual use. Used market is also crazy. Powers ~1500W continuously.

(b) Mac Studio M2 Ultra ($7K, ~370W):

  • Memory: 192 GB unified. Llama 3 70B fp16 fits.
  • Bandwidth: 800 GB/s — about 1/4 of H100 but applied to a model that fits on ONE device.
  • Throughput at batch=1: ~20-30 tokens/s for 70B.
  • Software: MLX (Apple’s), llama.cpp, Ollama all work; PyTorch via MPS backend (rough but improving).
  • Plus: quiet, low power, desk-friendly. The “macOS workstation” feel.
  • Catch: lower peak performance than H100. Can’t train large models efficiently.

(c) 4× RTX 4090 PC ($15K, ~1800W):

  • Memory: 4× 24 GB = 96 GB total. Llama 3 70B fp16 (140 GB) DOESN’T FIT. Must use quantization (int4) or smaller model.
  • With int4 quantization (Q4_K_M, ~42 GB): fits, with KV cache room.
  • Bandwidth: 4× 1 TB/s — but per-GPU. PCIe-connected, not NVLink. PCIe = 64 GB/s bottleneck per pair.
  • Throughput at batch=1: ~30-50 tokens/s for 70B at int4 (limited by inter-GPU comm).
  • Software: PyTorch + CUDA fully supported. But multi-GPU coordination is tricky on PCIe.
  • Plus: more flexible than Mac; can also run other GPU workloads (gaming, rendering, fine-tuning small models).
  • Catch: needs powerful PSU, cooling, requires technical setup. Loud.

The picks:

  • For pure inference, single-user, “I want to run frontier models locally”: Mac Studio M2 Ultra. The simplest setup with the best fit-the-model story.
  • For inference + occasional small-model fine-tuning + gaming: 4× RTX 4090 PC.
  • For serious development (full fine-tuning, multi-GPU optimisation): 2× H100 PC (but only if budget allows; otherwise rent cloud H100s).

The deep point: Apple’s unified memory created a new product category — “local frontier LLM inference at $7K” — that didn’t exist before. For users whose primary need is inference, this is increasingly competitive with traditional GPU setups.

AMD MI300X — the credible alternative

AMD’s MI300X is the closest direct competitor to H100:

MI300X vs H100: MI300X H100 HBM capacity 192 GB 80 GB (MI300X 2.4×) HBM bandwidth 5.3 TB/s 3.35 TB/s (MI300X 1.6×) bf16 FLOPs ~750 TFLOPs 989 TFLOPs (H100 ~1.3×) fp8 FLOPs ~1.5 PFLOPs 1.98 PFLOPs (H100 ~1.3×) Memory-bound apps FAVOURS MI300X Compute-bound apps FAVOURS H100 Inter-GPU bandwidth ~900 GB/s 900 GB/s (parity) AMD's bet: more memory per GPU. A 70B model in fp16 (140 GB) FITS ON ONE MI300X. On H100, you need 2 GPUs + multi-GPU coordination. Catch: software. ROCm (AMD's CUDA equivalent) is improving but lags. PyTorch support is partial; libraries like FlashAttention 3 had to be ported. The software gap is real but narrowing. Recent commercial uptake: - Meta has deployed thousands of MI300X for Llama training. - Microsoft Azure offers MI300X-based VMs. - OpenAI uses some MI300X capacity. Production adoption is real but still secondary to NVIDIA.

The MI300X case is structural — the 192 GB HBM is genuinely advantageous for many workloads. For pure inference of a 70B model, one MI300X is simpler and cheaper than two H100s. The catch is software maturity; ROCm has lagged CUDA by ~2 years for the past decade. This gap is closing as AMD invests, but it’s real today.

Why NVIDIA still wins

After all this, why is NVIDIA still dominant?

The NVIDIA moat (2025 state): 1. CUDA ecosystem. - Every ML library (PyTorch, JAX, TensorFlow) primarily targets CUDA. - 99% of published ML research uses NVIDIA hardware. - "It just works" — minimal porting overhead. 2. Performance per-dollar at frontier scale. - For training, NVIDIA's tooling extracts more % of peak FLOPs. - Even when MI300X or TPU has more raw FLOPs, NVIDIA often delivers more usable FLOPs because of mature kernels. 3. NVLink + Mellanox networking. - NVIDIA owns the GPU + InfiniBand stack via Mellanox acquisition. - End-to-end NVIDIA cluster has tighter integration than mixed stacks. 4. Risk aversion. - When training a $100M model, you don't want hardware surprises. - NVIDIA + CUDA is the "boring choice" with predictable behaviour. 5. Talent + community. - Most ML engineers learned on NVIDIA. Knowledge transfers. - Most performance optimisation discourse references NVIDIA. What could change this: - A 2-3× advantage on inference price/perf (e.g., MI300X at half the price of H100 for same throughput). - Major software effort from AMD / Intel / Google to close CUDA gap. - Geographic / supply chain forcing functions. What likely won't change it (2-3 year horizon): - The frontier remains NVIDIA. The mid-tier is where alternatives gain share.
— think, then check —

The actual obstacles:

1. CUDA dominance.

Every ML library, every framework, every tool primarily targets CUDA. Porting work is substantial: getting equivalent PyTorch performance on MI300X took AMD ~3 years of full-time effort. JAX on TPU is well-supported but doesn’t have the ecosystem of PyTorch.

Cost to switch: estimated $1-10M+ of engineering investment per major lab to port stack to non-NVIDIA. Not impossible but high.

2. Performance at frontier scale.

NVIDIA’s mature kernels (cuBLAS, FlashAttention, cuDNN) extract 60-70% of peak. AMD’s ROCm extracts 30-50% on the same workloads. TPU’s XLA extracts 50-70% but only on specific workloads.

For training a $100M model, even a 20% efficiency gap means $20M wasted. NVIDIA wins on “usable FLOPs” not just “claimed FLOPs.”

3. Networking integration.

NVIDIA + Mellanox (now owned by NVIDIA) is the “end-to-end” stack. Mixing NVIDIA GPUs with non-NVIDIA networking is risky; pure-NVIDIA cluster is predictable.

4. Talent / knowledge.

Most ML engineers know CUDA. Performance optimisation, debugging, kernel writing — all the lore is NVIDIA-centric. Hiring people who can extract performance from MI300X is much harder.

5. Risk aversion.

Frontier labs aren’t optimising for “lowest cost”; they’re optimising for “lowest risk of training failure.” NVIDIA’s predictability is worth a 20-30% price premium.

What would cause a shift:

  1. Genuine performance gap at frontier. If AMD or someone delivered 3× better throughput per dollar AND tooling caught up to within 10% of CUDA, labs would switch. AMD is not there yet (closer to 1.2-1.5× at best with software maturity catching up).
  2. Geographic / supply chain forcing. If NVIDIA’s supply chain were disrupted (export controls, manufacturing issues), labs would have to use alternatives.
  3. Open-source CUDA equivalent. A community-maintained CUDA-compatible runtime that works on multiple hardware (Triton is partly this, but doesn’t yet cover everything).
  4. Hardware-specific architecture optimisation. A model architecture that maps PERFECTLY to TPU’s systolic array (or to Cerebras’s wafer-scale) might give 2-3× speedup over NVIDIA for that specific architecture. So far no such “TPU-native” or “Cerebras-native” model has emerged.

What’s actually happening in 2025:

  • Frontier: still NVIDIA (95%+).
  • Mid-tier training: NVIDIA majority, AMD and TPU minority (5-15%).
  • Inference for very large models: AMD MI300X gaining share due to 192 GB HBM.
  • On-device inference: Apple Silicon dominant, with Qualcomm chasing.
  • Hyperscaler in-house: Google (TPU), AWS (Trainium/Inferentia), Microsoft (announced Maia) — building alternatives for capex reasons but not displacing NVIDIA in commercial offerings.

The honest forecast:

NVIDIA dominance erodes 5-10% over the next 2-3 years. The decline is from the bottom (commodity inference, on-device) rather than the top. The frontier stays NVIDIA until either AMD ships a much-better-than-H100 chip or NVIDIA stumbles. Neither looks imminent.

END OF CH.21 — The hardware substrate.
§1 (memory hierarchy + roofline: H100 ridge at ~300 FLOPs/byte, almost everything below) · §2 (tensor cores, fp8, NVLink: how H100 reaches 1 PFLOPs and what’s needed to use it) · §3 (TPUs, Apple Silicon, AMD MI300X: the alternatives and the genuine cases where each wins).

Next: Ch.22 — Runtimes & frameworks. PyTorch’s dispatch stack, what a CUDA kernel actually is, the GGUF/ONNX/safetensors formats.