RUNTIMES & FRAMEWORKS
Section 22.2
02

CUDA, Triton, MLX — what a ‘kernel’ is

Throughout Parts I-II we wrote SIMD kernels in C — explicit AVX2 intrinsics, manual cache blocking, tight loops. Those run on CPUs. Their GPU equivalents are CUDA kernels — also C/C++, also explicit, but written for thousands of GPU threads running the same code. The catch: writing CUDA is tedious and error-prone. Triton (OpenAI 2021) is a Python DSL that compiles to CUDA but lets you write kernels in something closer to numpy. MLX (Apple 2023) is a higher-level framework for Apple Silicon. This section shows a real CUDA dot-product, compares to its Triton equivalent, and walks the kernel-writing stack.

A CUDA dot product

The CUDA equivalent of Ch.1’s dot product kernel:

// CUDA dot product (simplified) __global__ void dot_kernel(const float* a, const float* b, float* result, int N) { __shared__ float partial_sums[BLOCK_SIZE]; int tid = threadIdx.x; int bid = blockIdx.x; int gid = bid * blockDim.x + tid; // Each thread computes one partial product float thread_sum = 0.0f; for (int i = gid; i < N; i += gridDim.x * blockDim.x) { thread_sum += a[i] * b[i]; } // Store in shared memory partial_sums[tid] = thread_sum; __syncthreads(); // Tree reduction within the block for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) { if (tid < stride) { partial_sums[tid] += partial_sums[tid + stride]; } __syncthreads(); } // Block sum gets atomically added to global result if (tid == 0) atomicAdd(result, partial_sums[0]); } // Launch: dot_kernel<<<num_blocks, BLOCK_SIZE>>>(a_dev, b_dev, result_dev, N);

What’s going on in this kernel:

  1. Many threads run in parallel (typically 128-1024 per block, many blocks per launch).
  2. Each thread computes a partial sum of the dot product over its assigned range.
  3. Shared memory (SRAM!) holds partial sums during a tree-reduction within the block.
  4. atomicAdd combines block sums into a single global result.

CUDA kernels give full control over the GPU but at the cost of boilerplate. The dot-product example is “simple” by CUDA standards; a FlashAttention kernel is 2000+ lines of CUDA, optimised for specific GPU generations.

Triton — Python DSL that compiles to CUDA

OpenAI 2021 (“Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations”) introduced a Python DSL that targets CUDA:

# Triton dot product (much shorter) import triton import triton.language as tl @triton.jit def dot_kernel( a_ptr, b_ptr, result_ptr, N, BLOCK_SIZE: tl.constexpr, ): # Get this thread block's range pid = tl.program_id(0) offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) mask = offsets < N # Load chunks (mask handles out-of-bounds) a = tl.load(a_ptr + offsets, mask=mask) b = tl.load(b_ptr + offsets, mask=mask) # Compute partial sum and reduce within block partial = tl.sum(a * b) # Atomically add to result tl.atomic_add(result_ptr, partial) # Launch: dot_kernel[(num_blocks,)](a, b, result, N, BLOCK_SIZE=1024)

Triton is dramatically more concise than CUDA — typically 5-10× fewer lines for equivalent kernels. The compiler handles:

What the programmer specifies:

Triton’s hidden cost: it’s less flexible than raw CUDA. Some optimisations (specific warp-level tricks, complex shared-memory layouts) need raw CUDA. But for ~80% of LLM kernel use cases, Triton is enough and dramatically faster to write.

— think, then check —

The three levels:

  1. Thread: the smallest unit of execution. Each thread runs the kernel function independently. A thread has its own registers and program counter. Typical work: ONE output element or a small tile of output.
  2. Block: a group of threads that run on the same Streaming Multiprocessor (SM). Threads in a block can share memory (via shared memory / SRAM) and synchronise (__syncthreads). Typical block size: 128-1024 threads.
  3. Grid: all the blocks needed to cover the work. Each block runs INDEPENDENTLY (cannot synchronise with other blocks) on whichever SM has capacity.

Mapping for a matmul C = A · B with A: M×K, B: K×N:

  • Block layout: use 2D blocks, where each block computes a TILE of C of size BLOCK_M × BLOCK_N (e.g., 128 × 128). Total blocks: (M / BLOCK_M) × (N / BLOCK_N).
  • Thread layout: within a block, 256 threads. Each thread computes a small (BLOCK_M/16) × (BLOCK_N/16) chunk of the block’s tile (e.g., 8 × 8 = 64 output values per thread).
  • Per-thread work: loop over k from 0 to K, loading A’s column and B’s row into registers / shared memory, accumulating into the thread’s output chunk.
  • Shared memory usage: within each block, threads cooperatively load A’s tile (BLOCK_M × BLOCK_K) and B’s tile (BLOCK_K × BLOCK_N) into shared memory. Each thread then reads from shared memory for its accumulations.
  • Tensor cores: for bf16/fp16 matmul, the inner loop uses mma.sync instructions (Hopper) that do 16×16 matmul per cycle.

For attention’s Q·K^T:

  • Same structure as matmul, with Q and K being the inputs.
  • The KEY innovation of FlashAttention: load Q tile and K tile into shared memory, compute scores, apply softmax (running stats), apply to V — ALL while staying in shared memory. Don’t materialise the full N×N score matrix in HBM.
  • This requires more shared memory management than a plain matmul — hence the FlashAttention kernel’s complexity (2000+ lines of CUDA).

Why this matters:

Good GPU performance comes from:

  • Right tile sizes so shared memory is well utilised.
  • Coalesced HBM accesses (adjacent threads load adjacent memory).
  • Hiding HBM latency by overlapping loads with computes.
  • Using tensor cores when possible.

Triton handles much of this automatically; raw CUDA requires careful hand-tuning.

MLX — Apple Silicon’s framework

Apple 2023 (MLX) is Apple’s answer to “how do you efficiently run ML on M-series chips”:

MLX characteristics: - Pure Python + C++ implementation, no CUDA. - Unified memory model (same as the M-series hardware). - Lazy evaluation (computations are deferred until needed). - Compiles to Apple Silicon's GPU + Neural Engine. - Open source (Apple's most accessible ML framework to date). Example: import mlx.core as mx x = mx.random.normal((1024, 1024)) y = mx.matmul(x, x.T) z = mx.softmax(y, axis=-1) # No computation has run yet — all lazy. mx.eval(z) # Now z is evaluated; results are in unified memory. Performance: For inference on M3 Max: matches or exceeds PyTorch + MPS backend. Tooled specifically for Apple's hardware (uses Metal Performance Shaders, ANE). Llama 3 7B at 30+ tokens/s on M3 Max in MLX (vs ~15 tokens/s in PyTorch + MPS).

MLX is Apple’s bet that the “future of consumer ML” is on-device. It’s well-designed for the M-series architecture (specifically the unified memory) and has good integration with Apple’s tooling. The community is small but growing.

The catch: MLX is Apple-only. There’s no path to CUDA, no community of kernel writers, no path to scale. It’s a specialised tool for a specific platform.

— think, then check —

The trade-off:

Triton is easier to write, easier to debug, and well-supported by the Inductor compiler. The compiler handles many optimisations automatically.

Raw CUDA gives full control but requires writing more code and managing more details. Specific low-level optimisations (warp specialisation, specific memory patterns, async copies) are easier in CUDA.

When Triton is enough:

For ~80% of new kernel work:

  • Element-wise operations (LayerNorm, GELU, RMS).
  • Simple matmuls.
  • Fused operations where the fusion is the main win.
  • Quick prototyping and iteration.
  • Kernels for less-popular precisions (fp4, fp8) where CUDA’s tooling is less mature.

FlashAttention 2’s Triton version achieved ~80% of the hand-tuned CUDA version’s performance with 1/10 the code. For most labs, this is the right trade-off.

When raw CUDA is needed:

For peak performance on specific hardware:

  • Cutting-edge optimisations (warp specialisation, async copies, specific PTX instructions).
  • Multi-stage pipelining that Triton’s compiler doesn’t optimise.
  • Hopper-specific features (TMA, distributed shared memory) that Triton doesn’t yet support.
  • Performance gaps that matter at billion-dollar training-run scale.

FlashAttention 3 (2024) targets H100 specifically and uses CUDA for the latest hardware features. The 20% performance gap over Triton matters when you’re training $100M models.

The pattern:

  • Research / prototyping: Triton (or torch.compile generating Triton).
  • Production / standard inference: Triton, hand-tuned where needed.
  • Frontier / pre-training: raw CUDA (FlashAttention 3, CUTLASS, custom kernels).

The hierarchy reflects the cost trade-off: Triton saves engineering time at the cost of last 10-20% performance. At small scale, save time. At frontier scale, that 20% is worth months of engineering.

The kernel stack in practice

Where each kernel comes from for a typical Llama 3 inference: Operation Eager fallback torch.compile Hand-tuned matmul (large) cuBLAS cuBLAS or Triton cuBLAS or CUTLASS matmul (small) cuBLAS Triton (fused) Triton RMSNorm aten kernel Triton (fused) hand-tuned CUDA / Triton attention naive multi-kernel FlashAttention via Inductor FlashAttention 3 CUDA softmax aten kernel Triton (fused) hand-tuned embedding lookup aten kernel Triton rarely customised The "hand-tuned" column is what frontier labs ship. The "torch.compile" column is what most users get with @torch.compile. The "eager fallback" is what naive PyTorch code uses.
— think, then check —

Why Apple shipped MLX:

PyTorch + MPS (Metal Performance Shaders) backend works on Mac but with friction:

  • Performance: significantly worse than the same model on equivalent CUDA. Apple Silicon’s hardware can do more than what MPS exposes.
  • Reliability: some operations fall back to CPU (slow); others crash or produce wrong results.
  • Tooling: limited support for Apple-specific features (unified memory, ANE).
  • Roadmap: PyTorch’s MPS development is community-driven; Apple has limited influence over priorities.

Apple’s response: build a framework FROM SCRATCH for their hardware.

MLX’s niche:

  • On-device inference: the primary target. Run frontier LLMs locally on Mac without depending on PyTorch.
  • Apple Silicon-specific optimisations: tight integration with unified memory; uses Metal Performance Shaders; can dispatch to the Neural Engine.
  • Research at Apple: Apple’s internal ML research uses MLX; the framework is shaped by their use cases.
  • Open source: first major Apple ML framework that’s openly developed; community can contribute.

Vs PyTorch + MPS:

  • MLX is FASTER on Mac for typical inference workloads (2-3× in many benchmarks).
  • MLX is SIMPLER for Mac-specific use cases (unified memory, on-device).
  • MLX is more LIMITED in ecosystem: fewer libraries, fewer pre-trained models, less community knowledge.
  • MLX is Mac-only — no path to NVIDIA, no path to scale.

The realistic choice:

If you’re building a Mac-only app (e.g., a desktop LLM tool, an iOS app): MLX is becoming the right choice.

If you’re doing cross-platform research: PyTorch (with MPS for Mac development, CUDA for actual experiments).

If you’re at a frontier lab: NVIDIA / CUDA, full stop.

MLX is positioned as the “consumer / edge” ML framework for Apple’s ecosystem. It’s not competing with PyTorch for the cloud / research market, but it’s clearly the right tool for the on-device space.

Next: §22.3 — ONNX, GGUF, safetensors. Model interchange formats and what a model file actually contains.