Runtimes & frameworks

§1 PyTorch eager vs compiled — the dispatch stack
PyTorch by default runs in EAGER mode — each Python operation triggers an immediate kernel launch. torch.compile (2023+) introduces COMPILED mode where the model is traced, optimised, fused, and emitted as a single kernel sequence. The "dispatch stack" is what routes an operation (e.g., torch.matmul on bf16 CUDA) to the right kernel implementation. Understanding this stack is essential for debugging performance.
§2 CUDA, Triton, MLX — what a 'kernel' is
A "kernel" is a function that runs on accelerator hardware (GPU, Neural Engine, etc.). CUDA is NVIDIA‘s native C++ extension for writing GPU kernels — fast but tedious. Triton (OpenAI 2021) is a Python-DSL alternative that emits CUDA but with much less boilerplate. MLX is Apple‘s framework for Apple Silicon. This section shows what a real CUDA dot-product kernel looks like, compares to the Triton equivalent, and explains MLX‘s niche.
§3 ONNX, GGUF, safetensors — model interchange formats
A model file contains weights plus metadata (architecture, dtypes, vocab). Three major formats: safetensors (the modern Hugging Face default), GGUF (llama.cpp ecosystem, supports quantization), ONNX (cross-framework, primarily for inference deployment). Each made different design trade-offs; understanding them is essential for any production work.