CUDA, Triton, MLX — what a 'kernel' is

Section 22.2

CUDA, Triton, MLX — what a ‘kernel’ is

Throughout Parts I-II we wrote SIMD kernels in C — explicit AVX2 intrinsics, manual cache blocking, tight loops. Those run on CPUs. Their GPU equivalents are CUDA kernels — also C/C++, also explicit, but written for thousands of GPU threads running the same code. The catch: writing CUDA is tedious and error-prone. Triton (OpenAI 2021) is a Python DSL that compiles to CUDA but lets you write kernels in something closer to numpy. MLX (Apple 2023) is a higher-level framework for Apple Silicon. This section shows a real CUDA dot-product, compares to its Triton equivalent, and walks the kernel-writing stack.

A CUDA dot product

The CUDA equivalent of Ch.1’s dot product kernel:

// CUDA dot product (simplified) __global__ void dot_kernel(const float* a, const float* b, float* result, int N) { __shared__ float partial_sums[BLOCK_SIZE]; int tid = threadIdx.x; int bid = blockIdx.x; int gid = bid * blockDim.x + tid; // Each thread computes one partial product float thread_sum = 0.0f; for (int i = gid; i < N; i += gridDim.x * blockDim.x) { thread_sum += a[i] * b[i]; } // Store in shared memory partial_sums[tid] = thread_sum; __syncthreads(); // Tree reduction within the block for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) { if (tid < stride) { partial_sums[tid] += partial_sums[tid + stride]; } __syncthreads(); } // Block sum gets atomically added to global result if (tid == 0) atomicAdd(result, partial_sums[0]); } // Launch: dot_kernel<<<num_blocks, BLOCK_SIZE>>>(a_dev, b_dev, result_dev, N);

What’s going on in this kernel:

Many threads run in parallel (typically 128-1024 per block, many blocks per launch).
Each thread computes a partial sum of the dot product over its assigned range.
Shared memory (SRAM!) holds partial sums during a tree-reduction within the block.
atomicAdd combines block sums into a single global result.

CUDA kernels GPU programming A function that runs on NVIDIA GPUs, written in C++ with CUDA extensions. Marked with __global__, launched with the <<<...>>> syntax. Each invocation runs thousands of threads in parallel, organised into thread blocks. The kernel author manages: thread indexing, shared memory allocation, warp-level synchronisation, memory access patterns. Powerful but tedious; a non-trivial kernel can be 1000+ lines of C++. Most LLM-relevant CUDA kernels (FlashAttention, FlashInfer, etc.) are open-source and developed by specialist teams. give full control over the GPU but at the cost of boilerplate. The dot-product example is “simple” by CUDA standards; a FlashAttention kernel is 2000+ lines of CUDA, optimised for specific GPU generations.

Triton — Python DSL that compiles to CUDA

OpenAI 2021 (“Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations”) introduced a Python DSL that targets CUDA:

# Triton dot product (much shorter) import triton import triton.language as tl @triton.jit def dot_kernel( a_ptr, b_ptr, result_ptr, N, BLOCK_SIZE: tl.constexpr, ): # Get this thread block's range pid = tl.program_id(0) offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) mask = offsets < N # Load chunks (mask handles out-of-bounds) a = tl.load(a_ptr + offsets, mask=mask) b = tl.load(b_ptr + offsets, mask=mask) # Compute partial sum and reduce within block partial = tl.sum(a * b) # Atomically add to result tl.atomic_add(result_ptr, partial) # Launch: dot_kernel[(num_blocks,)](a, b, result, N, BLOCK_SIZE=1024)

Triton GPU DSL A Python-embedded DSL for writing GPU kernels (OpenAI, 2021). Triton code looks like NumPy with explicit block-level parallelism; the Triton compiler emits CUDA. Eliminates much of CUDA's boilerplate (thread indexing, shared memory management, warp synchronisation) while still allowing kernel-level optimisation. Now the preferred way to write custom GPU kernels for many ML applications. Used by FlashAttention 2/3, OpenAI's training stack, PyTorch's torch.compile Inductor backend. is dramatically more concise than CUDA — typically 5-10× fewer lines for equivalent kernels. The compiler handles:

Thread indexing within blocks.
Shared memory allocation and synchronisation.
Warp-level reduction patterns.
Pointer arithmetic and masking.

What the programmer specifies:

The block size and tiling strategy.
The high-level computation (load, compute, store).
Memory access patterns (Triton’s compiler optimises these).

Triton’s hidden cost: it’s less flexible than raw CUDA. Some optimisations (specific warp-level tricks, complex shared-memory layouts) need raw CUDA. But for ~80% of LLM kernel use cases, Triton is enough and dramatically faster to write.

— think, then check —

The three levels:

Thread: the smallest unit of execution. Each thread runs the kernel function independently. A thread has its own registers and program counter. Typical work: ONE output element or a small tile of output.
Block: a group of threads that run on the same Streaming Multiprocessor (SM). Threads in a block can share memory (via shared memory / SRAM) and synchronise (__syncthreads). Typical block size: 128-1024 threads.
Grid: all the blocks needed to cover the work. Each block runs INDEPENDENTLY (cannot synchronise with other blocks) on whichever SM has capacity.

Mapping for a matmul C = A · B with A: M×K, B: K×N:

Block layout: use 2D blocks, where each block computes a TILE of C of size BLOCK_M × BLOCK_N (e.g., 128 × 128). Total blocks: (M / BLOCK_M) × (N / BLOCK_N).
Thread layout: within a block, 256 threads. Each thread computes a small (BLOCK_M/16) × (BLOCK_N/16) chunk of the block’s tile (e.g., 8 × 8 = 64 output values per thread).
Per-thread work: loop over k from 0 to K, loading A’s column and B’s row into registers / shared memory, accumulating into the thread’s output chunk.
Shared memory usage: within each block, threads cooperatively load A’s tile (BLOCK_M × BLOCK_K) and B’s tile (BLOCK_K × BLOCK_N) into shared memory. Each thread then reads from shared memory for its accumulations.
Tensor cores: for bf16/fp16 matmul, the inner loop uses mma.sync instructions (Hopper) that do 16×16 matmul per cycle.

For attention’s Q·K^T:

Same structure as matmul, with Q and K being the inputs.
The KEY innovation of FlashAttention: load Q tile and K tile into shared memory, compute scores, apply softmax (running stats), apply to V — ALL while staying in shared memory. Don’t materialise the full N×N score matrix in HBM.
This requires more shared memory management than a plain matmul — hence the FlashAttention kernel’s complexity (2000+ lines of CUDA).

Why this matters:

Good GPU performance comes from:

Right tile sizes so shared memory is well utilised.
Coalesced HBM accesses (adjacent threads load adjacent memory).
Hiding HBM latency by overlapping loads with computes.
Using tensor cores when possible.

Triton handles much of this automatically; raw CUDA requires careful hand-tuning.

↳ §22.2 + CUDA programming guide

MLX — Apple Silicon’s framework

Apple 2023 (MLX) is Apple’s answer to “how do you efficiently run ML on M-series chips”:

MLX characteristics: - Pure Python + C++ implementation, no CUDA. - Unified memory model (same as the M-series hardware). - Lazy evaluation (computations are deferred until needed). - Compiles to Apple Silicon's GPU + Neural Engine. - Open source (Apple's most accessible ML framework to date). Example: import mlx.core as mx x = mx.random.normal((1024, 1024)) y = mx.matmul(x, x.T) z = mx.softmax(y, axis=-1) # No computation has run yet — all lazy. mx.eval(z) # Now z is evaluated; results are in unified memory. Performance: For inference on M3 Max: matches or exceeds PyTorch + MPS backend. Tooled specifically for Apple's hardware (uses Metal Performance Shaders, ANE). Llama 3 7B at 30+ tokens/s on M3 Max in MLX (vs ~15 tokens/s in PyTorch + MPS).

MLX Apple ML framework Apple's ML framework for Apple Silicon (announced December 2023, open-sourced). Designed for the unified memory architecture of M-series chips. Features: lazy evaluation, unified Python + Swift APIs, Metal Performance Shaders backend, Neural Engine integration. Used by llama.cpp's Metal backend, Ollama, and increasingly by Hugging Face users for on-device inference. Closest 'native PyTorch alternative' for Apple Silicon; PyTorch + MPS works but with friction. is Apple’s bet that the “future of consumer ML” is on-device. It’s well-designed for the M-series architecture (specifically the unified memory) and has good integration with Apple’s tooling. The community is small but growing.

The catch: MLX is Apple-only. There’s no path to CUDA, no community of kernel writers, no path to scale. It’s a specialised tool for a specific platform.

— think, then check —

The trade-off:

Triton is easier to write, easier to debug, and well-supported by the Inductor compiler. The compiler handles many optimisations automatically.

Raw CUDA gives full control but requires writing more code and managing more details. Specific low-level optimisations (warp specialisation, specific memory patterns, async copies) are easier in CUDA.

When Triton is enough:

For ~80% of new kernel work:

Element-wise operations (LayerNorm, GELU, RMS).
Simple matmuls.
Fused operations where the fusion is the main win.
Quick prototyping and iteration.
Kernels for less-popular precisions (fp4, fp8) where CUDA’s tooling is less mature.

FlashAttention 2’s Triton version achieved ~80% of the hand-tuned CUDA version’s performance with 1/10 the code. For most labs, this is the right trade-off.

When raw CUDA is needed:

For peak performance on specific hardware:

Cutting-edge optimisations (warp specialisation, async copies, specific PTX instructions).
Multi-stage pipelining that Triton’s compiler doesn’t optimise.
Hopper-specific features (TMA, distributed shared memory) that Triton doesn’t yet support.
Performance gaps that matter at billion-dollar training-run scale.

FlashAttention 3 (2024) targets H100 specifically and uses CUDA for the latest hardware features. The 20% performance gap over Triton matters when you’re training $100M models.

The pattern:

Research / prototyping: Triton (or torch.compile generating Triton).
Production / standard inference: Triton, hand-tuned where needed.
Frontier / pre-training: raw CUDA (FlashAttention 3, CUTLASS, custom kernels).

The hierarchy reflects the cost trade-off: Triton saves engineering time at the cost of last 10-20% performance. At small scale, save time. At frontier scale, that 20% is worth months of engineering.

↳ §22.2 + FlashAttention codebases

The kernel stack in practice

Where each kernel comes from for a typical Llama 3 inference: Operation Eager fallback torch.compile Hand-tuned matmul (large) cuBLAS cuBLAS or Triton cuBLAS or CUTLASS matmul (small) cuBLAS Triton (fused) Triton RMSNorm aten kernel Triton (fused) hand-tuned CUDA / Triton attention naive multi-kernel FlashAttention via Inductor FlashAttention 3 CUDA softmax aten kernel Triton (fused) hand-tuned embedding lookup aten kernel Triton rarely customised The "hand-tuned" column is what frontier labs ship. The "torch.compile" column is what most users get with @torch.compile. The "eager fallback" is what naive PyTorch code uses.

— think, then check —

Why Apple shipped MLX:

PyTorch + MPS (Metal Performance Shaders) backend works on Mac but with friction:

Performance: significantly worse than the same model on equivalent CUDA. Apple Silicon’s hardware can do more than what MPS exposes.
Reliability: some operations fall back to CPU (slow); others crash or produce wrong results.
Tooling: limited support for Apple-specific features (unified memory, ANE).
Roadmap: PyTorch’s MPS development is community-driven; Apple has limited influence over priorities.

Apple’s response: build a framework FROM SCRATCH for their hardware.

MLX’s niche:

On-device inference: the primary target. Run frontier LLMs locally on Mac without depending on PyTorch.
Apple Silicon-specific optimisations: tight integration with unified memory; uses Metal Performance Shaders; can dispatch to the Neural Engine.
Research at Apple: Apple’s internal ML research uses MLX; the framework is shaped by their use cases.
Open source: first major Apple ML framework that’s openly developed; community can contribute.

Vs PyTorch + MPS:

MLX is FASTER on Mac for typical inference workloads (2-3× in many benchmarks).
MLX is SIMPLER for Mac-specific use cases (unified memory, on-device).
MLX is more LIMITED in ecosystem: fewer libraries, fewer pre-trained models, less community knowledge.
MLX is Mac-only — no path to NVIDIA, no path to scale.

The realistic choice:

If you’re building a Mac-only app (e.g., a desktop LLM tool, an iOS app): MLX is becoming the right choice.

If you’re doing cross-platform research: PyTorch (with MPS for Mac development, CUDA for actual experiments).

If you’re at a frontier lab: NVIDIA / CUDA, full stop.

MLX is positioned as the “consumer / edge” ML framework for Apple’s ecosystem. It’s not competing with PyTorch for the cloud / research market, but it’s clearly the right tool for the on-device space.

↳ §22.2 + MLX docs

Next: §22.3 — ONNX, GGUF, safetensors. Model interchange formats and what a model file actually contains.