ONNX, GGUF, safetensors — model interchange formats
A trained model is just a bag of numbers (weights) with structure (the architecture). Storing and shipping these efficiently is a deceptively hard problem: you need to support multiple precisions, multiple architectures, lazy loading, safety (no arbitrary-code execution at load time), and cross-framework portability. Three formats have emerged as the practical answers in 2025: safetensors (the modern default for Hugging Face / PyTorch), GGUF (llama.cpp’s format, supports quantization), ONNX (cross-framework, primarily for inference deployment). This section walks each format’s design and the trade-offs.
safetensors — the modern Hugging Face default
The story: PyTorch’s original format was pickle-based. Pickle can execute arbitrary code at load time. Loading an unverified model could compromise your machine. Hugging Face 2022 introduced safetensors as the safe alternative.
safetensors is the default for nearly every new model on Hugging Face Hub. The format is intentionally simple: 8 bytes for header length, then a JSON header describing each tensor’s layout, then raw tensor data. mmap-friendly, parser-free, safe.
The key safety property: the file format cannot contain executable code. The JSON header is interpreted as pure data; the tensor data is interpreted as raw bytes. There’s no analogue to pickle’s reduce or __setstate__ that could execute arbitrary code at load time.
GGUF — llama.cpp’s format
GGUF (GGML Universal Format) is the format that ships with llama.cpp:
GGUF is the format you see on Hugging Face when downloading “Llama 3 8B Q4_K_M” or “Mistral 7B IQ4_XS.” It’s optimised for the use case of running quantized models on consumer hardware via llama.cpp.
The pickle vulnerability:
Python’s pickle serialisation format supports arbitrary Python object reconstruction via reduce and setstate methods. When you call pickle.load, the deserialiser may execute Python code to reconstruct the object.
This is by design — it’s what makes pickle handle complex Python objects. But it also means a malicious .pt file can execute ARBITRARY CODE at load time:
import pickle
class Exploit: def reduce(self): import os return (os.system, (“rm -rf ~/important_files”,))
pickle.dump(Exploit(), open(“model.pt”, “wb”))
Anyone who calls pickle.load(“model.pt”) runs rm -rf
This isn’t theoretical: malicious models have been discovered on model hubs (rare but documented). The exploitation works because most users download models without inspecting them and “load and run” — at which point the malicious code executes.
safetensors’ fix:
The file format is PURELY DATA. No code can be embedded. The “tensors” are described by JSON metadata (which is interpreted as data, not code) and raw byte arrays (also data).
Loading a safetensors file CANNOT execute code from the file. The worst a malicious safetensors file can do is be malformed (deserialiser throws an error) or contain wrong-shaped tensors (loaded model is garbage).
Why this matters at scale:
Hugging Face Hub hosts hundreds of thousands of models. Many are downloaded millions of times. A malicious model entering the supply chain could compromise countless machines.
safetensors removes this attack vector entirely. Hugging Face strongly prefers safetensors-format uploads and warns users about loading pickle-format models from untrusted sources.
For modern model authoring: ALWAYS save in safetensors format. PyTorch supports it natively: torch.save → use safetensors.torch.save_file instead.
ONNX — the cross-framework standard
ONNX (Open Neural Network Exchange, 2017) is the older, more abstract format:
ONNX was the dominant cross-framework format in the 2017-2022 era but has lost some ground for LLMs. The cross-framework portability matters less when one framework (PyTorch) dominates training and many production deployments use specialised inference engines (vLLM, TensorRT-LLM, llama.cpp) rather than ONNX Runtime.
What’s actually in a model file
Take Llama 3 8B as an example. The Hugging Face repository contains:
safetensors (~140 GB, sharded across 30+ files):
Use when:
- Fine-tuning the model: need full precision for stable training.
- Running on a multi-GPU server with enough HBM: 2× H100 with the original weights gives best quality.
- Loading with transformers / PEFT / accelerate: PyTorch ecosystem default.
- Modifying the architecture or doing research: safetensors is editable, GGUF less so.
Cost: huge file size, requires HBM matching the bf16 size, multi-shard download.
GGUF (~42 GB at Q4_K_M, single file):
Use when:
- Local inference on consumer hardware (RTX 4090, Apple Silicon, AMD GPUs via llama.cpp).
- llama.cpp / Ollama / LM Studio ecosystem.
- Distribution to end users (1 file vs 30 files makes deployment simpler).
- You’re OK with ~0.1-0.2 perplexity drop from quantization.
Cost: quantized so quality is slightly degraded; not the format for fine-tuning.
The typical workflow:
- Model is trained at high precision → saved in safetensors.
- For production inference at scale: serve safetensors via vLLM or TensorRT-LLM on H100/A100 GPUs.
- For local / consumer / Mac deployment: quantize safetensors → GGUF → distribute via llama.cpp ecosystem.
- For Hugging Face Hub: both formats hosted; users pick based on their setup.
Special cases:
- If you’re running on Apple Silicon: GGUF works via llama.cpp (Metal backend). MLX has its own format derived from safetensors. Both are options.
- If you’re doing extremely-low-bit deployment (1.5-2 bpw): IQ-quants are GGUF-only. safetensors doesn’t have an equivalent.
The picture: safetensors is the “source of truth” format; GGUF is the “consumer distribution” format. Most production stacks use one or the other depending on the inference target.
Loading the model — what actually happens
The mmap-based loading is what makes safetensors so fast. Compare to pickle: each tensor must be parsed from a Python object stream, allocated freshly, and copied. safetensors’ raw bytes can be mapped directly into PyTorch tensors.
Where ONNX still wins:
- Vision models in production: classification, detection, segmentation models trained in PyTorch are routinely exported to ONNX and run via ONNX Runtime in production. Stable, mature, fast.
- Edge devices: ONNX Runtime has lightweight backends for many embedded targets (ARM CPUs, mobile NPUs, etc.). Standard PyTorch can’t easily target these.
- Conversion to specialised runtimes: TensorRT (NVIDIA optimisation), OpenVINO (Intel), CoreML (Apple) all accept ONNX as input. Common deployment pattern: train PyTorch → export ONNX → convert to runtime-specific format.
- Microsoft ML ecosystem: ONNX Runtime is heavily used in Azure / .NET / Microsoft stack.
What makes LLMs hard for ONNX:
- Dynamic shapes: LLM inputs are variable-length sequences. ONNX supports dynamic shapes but with overhead; many ONNX runtimes optimise better for static shapes. LLM serving uses variable batch sizes (continuous batching), making static shapes impossible.
- KV cache management: the KV cache grows as tokens are generated; managing this efficiently requires runtime support for “stateful” inference. ONNX is designed for stateless models; adding cache state requires custom op extensions.
- Sampling and search: generation involves sampling logits, then re-running with the new token. ONNX models traditionally do single-shot inference; loop-with-state is awkward.
- Modern ops: FlashAttention, RoPE, RMSNorm, SwiGLU all need ONNX equivalents. The opset has been catching up but lags PyTorch.
- Quantization formats: LLM-specific quantization (q4_K, IQ-quants) has no ONNX equivalent. ONNX has int8 / int4 quantization but with different semantics.
- Performance: for LLM inference, specialised engines (vLLM, TensorRT-LLM, llama.cpp) outperform ONNX Runtime by 2-5×. ONNX Runtime hasn’t kept up with LLM-specific optimisations.
The practical state:
For LLMs, ONNX is largely bypassed. The pipeline is usually: PyTorch (training) → safetensors (storage) → vLLM / TensorRT-LLM / llama.cpp (inference). ONNX as an intermediate step adds friction without benefit.
For vision and traditional ML: ONNX remains the default deployment format. The CNN ecosystem hasn’t moved away.
For edge deployment: ONNX has competition from CoreML (Apple), TFLite (Google), and MediaPipe (mobile). ONNX is still relevant but not dominant.
The ONNX story: a successful interchange format for the 2018-era ML world, partially displaced by LLM-specific tooling but still valuable for cross-platform deployment of “classical” models.
END OF CH.22 — Runtimes and frameworks.
§1 (PyTorch dispatch stack: eager vs compiled, torch.compile speedups via fusion) ·
§2 (CUDA kernel structure, Triton DSL, MLX for Apple Silicon) ·
§3 (model formats: safetensors as default, GGUF for distribution, ONNX for cross-platform).
Next: Ch.23 — Inference at scale. KV cache, PagedAttention, continuous batching, vLLM internals.