PyTorch (eager vs compiled), CUDA, ONNX, MLX. What a 'kernel' is, the dispatch stack. A real CUDA dot product alongside its CPU SIMD twin.
PyTorch by default runs in EAGER mode — each Python operation triggers an immediate kernel launch. torch.compile (2023+) introduces COMPILED mode where the model is traced, optimised, fused, and emitted as a single kernel sequence. The "dispatch stack" is what routes an operation (e.g., torch.matmul on bf16 CUDA) to the right kernel implementation. Understanding this stack is essential for debugging performance.
A "kernel" is a function that runs on accelerator hardware (GPU, Neural Engine, etc.). CUDA is NVIDIA‘s native C++ extension for writing GPU kernels — fast but tedious. Triton (OpenAI 2021) is a Python-DSL alternative that emits CUDA but with much less boilerplate. MLX is Apple‘s framework for Apple Silicon. This section shows what a real CUDA dot-product kernel looks like, compares to the Triton equivalent, and explains MLX‘s niche.
A model file contains weights plus metadata (architecture, dtypes, vocab). Three major formats: safetensors (the modern Hugging Face default), GGUF (llama.cpp ecosystem, supports quantization), ONNX (cross-framework, primarily for inference deployment). Each made different design trade-offs; understanding them is essential for any production work.
← ALL CHAPTERS