ONNX, GGUF, safetensors — model interchange formats

Section 22.3

ONNX, GGUF, safetensors — model interchange formats

A trained model is just a bag of numbers (weights) with structure (the architecture). Storing and shipping these efficiently is a deceptively hard problem: you need to support multiple precisions, multiple architectures, lazy loading, safety (no arbitrary-code execution at load time), and cross-framework portability. Three formats have emerged as the practical answers in 2025: safetensors (the modern default for Hugging Face / PyTorch), GGUF (llama.cpp’s format, supports quantization), ONNX (cross-framework, primarily for inference deployment). This section walks each format’s design and the trade-offs.

safetensors — the modern Hugging Face default

The story: PyTorch’s original format was pickle-based. Pickle can execute arbitrary code at load time. Loading an unverified model could compromise your machine. Hugging Face 2022 introduced safetensors as the safe alternative.

safetensors file layout: [8 bytes] header_size (little-endian uint64) [N bytes] header (JSON, valid UTF-8) [...] raw tensor data (concatenated) Header format (JSON): { "model.embed_tokens.weight": { "dtype": "BF16", "shape": [32000, 4096], "data_offsets": [0, 262144000] }, "model.layers.0.self_attn.q_proj.weight": { "dtype": "BF16", "shape": [4096, 4096], "data_offsets": [262144000, 295698432] }, ... } Properties: - Zero-copy: tensors can be mmap'd directly from disk (no parsing needed for data). - Safe: header is pure JSON, no code execution. - Lazy-loadable: each tensor's offset is known; load only what you need. - Cross-framework: PyTorch, TensorFlow, JAX can all read it. - Sharded: large models split into multiple .safetensors files + index file.

safetensors model format A model serialisation format introduced by Hugging Face in 2022 as a safe replacement for PyTorch's pickle-based format. Structure: 8-byte header size + JSON header + raw tensor data. The JSON header lists each tensor's dtype, shape, and data offset. The raw tensor data can be memory-mapped directly without parsing. Safe (no code execution at load), fast (zero-copy load), cross-framework. The de facto default for Hugging Face Hub since 2023; almost all new models ship in safetensors. is the default for nearly every new model on Hugging Face Hub. The format is intentionally simple: 8 bytes for header length, then a JSON header describing each tensor’s layout, then raw tensor data. mmap-friendly, parser-free, safe.

The key safety property: the file format cannot contain executable code. The JSON header is interpreted as pure data; the tensor data is interpreted as raw bytes. There’s no analogue to pickle’s reduce or __setstate__ that could execute arbitrary code at load time.

GGUF — llama.cpp’s format

GGUF (GGML Universal Format) is the format that ships with llama.cpp:

GGUF file layout (simplified): [4 bytes] magic ("GGUF") [4 bytes] version [8 bytes] tensor count [8 bytes] metadata kv count [...] metadata (key-value pairs: architecture, vocab, etc.) [...] tensor info (name, shape, dtype, offset) [padding] alignment to 32-byte boundary [...] raw tensor data Differences from safetensors: - Embedded metadata: architecture, hyperparameters, tokenizer all in one file. - Native quantization support: dtype can be Q4_0, Q5_K, Q6_K, etc. (Ch.24). - Single-file: a 70B model in GGUF is ONE file (vs safetensors which is sharded). - llama.cpp-specific: designed for the llama.cpp inference engine. Why single-file matters for distribution: - Direct download. - Direct mmap for inference. - Hash-verified integrity. - Hugging Face Hub treats GGUF files as first-class.

GGUF model format A binary model format designed for llama.cpp's inference engine (2023). Features: single-file (no sharding), embedded architecture + tokenizer metadata, native support for GGML quantization formats (q4_0 through q6_K, IQ-quants). Designed for direct mmap and zero-allocation inference. Predominantly used by the llama.cpp / Ollama / LM Studio ecosystem. Most quantized models on Hugging Face are distributed as GGUF; non-quantized models more commonly as safetensors. is the format you see on Hugging Face when downloading “Llama 3 8B Q4_K_M” or “Mistral 7B IQ4_XS.” It’s optimised for the use case of running quantized models on consumer hardware via llama.cpp.

— think, then check —

The pickle vulnerability:

Python’s pickle serialisation format supports arbitrary Python object reconstruction via reduce and setstate methods. When you call pickle.load, the deserialiser may execute Python code to reconstruct the object.

This is by design — it’s what makes pickle handle complex Python objects. But it also means a malicious .pt file can execute ARBITRARY CODE at load time:

import pickle
class Exploit:
def reduce(self):
import os
return (os.system, (“rm -rf ~/important_files”,))
pickle.dump(Exploit(), open(“model.pt”, “wb”))
Anyone who calls pickle.load(“model.pt”) runs rm -rf

This isn’t theoretical: malicious models have been discovered on model hubs (rare but documented). The exploitation works because most users download models without inspecting them and “load and run” — at which point the malicious code executes.

safetensors’ fix:

The file format is PURELY DATA. No code can be embedded. The “tensors” are described by JSON metadata (which is interpreted as data, not code) and raw byte arrays (also data).

Loading a safetensors file CANNOT execute code from the file. The worst a malicious safetensors file can do is be malformed (deserialiser throws an error) or contain wrong-shaped tensors (loaded model is garbage).

Why this matters at scale:

Hugging Face Hub hosts hundreds of thousands of models. Many are downloaded millions of times. A malicious model entering the supply chain could compromise countless machines.

safetensors removes this attack vector entirely. Hugging Face strongly prefers safetensors-format uploads and warns users about loading pickle-format models from untrusted sources.

For modern model authoring: ALWAYS save in safetensors format. PyTorch supports it natively: torch.save → use safetensors.torch.save_file instead.

↳ §22.3 + safetensors spec

ONNX — the cross-framework standard

ONNX (Open Neural Network Exchange, 2017) is the older, more abstract format:

ONNX file: a Protocol Buffer serialisation of a "model graph." Contains: - Model architecture as a directed graph of OPS (nodes) and TENSORS (edges). - Each op is from a standardised opset (matmul, relu, layernorm, etc.). - Initialisers (the actual weights) are tensors with constant values. - Optional metadata: producer (which framework saved it), version, etc. Key design decision: stores the COMPUTATION GRAPH, not just weights. - You can load an ONNX model into ANY framework that supports the ops used. - The model is self-describing: the ops define how to run it. Use cases: - Cross-framework portability: train in PyTorch, deploy with ONNX Runtime. - Edge deployment: ONNX Runtime is lightweight and runs on many hardware targets. - Conversion to other formats: e.g., ONNX → TensorRT for NVIDIA-optimised inference. Limitations: - Op coverage: ONNX has a slow opset evolution; new PyTorch ops may not have ONNX equivalents. - LLM-specific challenges: dynamic shapes, KV cache, sampling all need careful handling. - Less common for cutting-edge LLM work; more common for "deploy this trained model."

ONNX model format A cross-framework model interchange format introduced in 2017 by Facebook and Microsoft. Stores the model as a computation graph (not just weights) using Protocol Buffers. Supports standardised operators (matmul, conv, attention, etc.) that any compatible runtime can execute. Used for deployment scenarios where the training framework differs from the inference runtime: train in PyTorch, deploy via ONNX Runtime, or convert to TensorRT for NVIDIA-optimised inference. Less common for LLMs than for traditional vision models, due to LLM-specific challenges (KV cache, dynamic shapes, sampling). was the dominant cross-framework format in the 2017-2022 era but has lost some ground for LLMs. The cross-framework portability matters less when one framework (PyTorch) dominates training and many production deployments use specialised inference engines (vLLM, TensorRT-LLM, llama.cpp) rather than ONNX Runtime.

What’s actually in a model file

Take Llama 3 8B as an example. The Hugging Face repository contains:

Llama-3-8B-Instruct repository contents: model-00001-of-00004.safetensors ~4.9 GB (weights, shard 1) model-00002-of-00004.safetensors ~4.9 GB (weights, shard 2) model-00003-of-00004.safetensors ~4.9 GB (weights, shard 3) model-00004-of-00004.safetensors ~1.2 GB (weights, shard 4) model.safetensors.index.json ~30 KB (tensor name → shard map) config.json ~1 KB (architecture params) generation_config.json ~200 B (sampling defaults) tokenizer.json ~9 MB (tokenizer vocab + merges) tokenizer_config.json ~200 B (special tokens) special_tokens_map.json ~600 B (BOS, EOS, etc.) README.md ~10 KB Total: ~16 GB for the 8B model in bf16. For comparison, GGUF quantized: Meta-Llama-3-8B-Instruct.Q4_K_M.gguf ~4.9 GB (single file) Same model in q4_K_M: 1/3 the size, single file. Reason GGUF is so popular for distribution of consumer-runnable models.

— think, then check —

safetensors (~140 GB, sharded across 30+ files):

Use when:

Fine-tuning the model: need full precision for stable training.
Running on a multi-GPU server with enough HBM: 2× H100 with the original weights gives best quality.
Loading with transformers / PEFT / accelerate: PyTorch ecosystem default.
Modifying the architecture or doing research: safetensors is editable, GGUF less so.

Cost: huge file size, requires HBM matching the bf16 size, multi-shard download.

GGUF (~42 GB at Q4_K_M, single file):

Use when:

Local inference on consumer hardware (RTX 4090, Apple Silicon, AMD GPUs via llama.cpp).
llama.cpp / Ollama / LM Studio ecosystem.
Distribution to end users (1 file vs 30 files makes deployment simpler).
You’re OK with ~0.1-0.2 perplexity drop from quantization.

Cost: quantized so quality is slightly degraded; not the format for fine-tuning.

The typical workflow:

Model is trained at high precision → saved in safetensors.
For production inference at scale: serve safetensors via vLLM or TensorRT-LLM on H100/A100 GPUs.
For local / consumer / Mac deployment: quantize safetensors → GGUF → distribute via llama.cpp ecosystem.
For Hugging Face Hub: both formats hosted; users pick based on their setup.

Special cases:

If you’re running on Apple Silicon: GGUF works via llama.cpp (Metal backend). MLX has its own format derived from safetensors. Both are options.
If you’re doing extremely-low-bit deployment (1.5-2 bpw): IQ-quants are GGUF-only. safetensors doesn’t have an equivalent.

The picture: safetensors is the “source of truth” format; GGUF is the “consumer distribution” format. Most production stacks use one or the other depending on the inference target.

↳ §22.3 + Hugging Face usage

Loading the model — what actually happens

from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct") What happens (rough sequence): 1. Download config.json → know architecture (LlamaForCausalLM). 2. Instantiate model in MEMORY (random weights, no actual values yet). 3. Find safetensors index → list of tensor names + shards. 4. For each tensor: - Open the relevant shard. - mmap the bytes corresponding to the tensor. - Load directly into the corresponding model parameter (no copy). 5. Set model.eval() if no training. 6. Return loaded model. Total time: 5-30 seconds for 8B model (mostly bandwidth-limited disk read). RAM peak: ~equal to model size (mmap-managed). GPU transfer: separate step via .to('cuda'), takes 1-5 seconds for 8B model.

The mmap-based loading is what makes safetensors so fast. Compare to pickle: each tensor must be parsed from a Python object stream, allocated freshly, and copied. safetensors’ raw bytes can be mapped directly into PyTorch tensors.

— think, then check —

Where ONNX still wins:

Vision models in production: classification, detection, segmentation models trained in PyTorch are routinely exported to ONNX and run via ONNX Runtime in production. Stable, mature, fast.
Edge devices: ONNX Runtime has lightweight backends for many embedded targets (ARM CPUs, mobile NPUs, etc.). Standard PyTorch can’t easily target these.
Conversion to specialised runtimes: TensorRT (NVIDIA optimisation), OpenVINO (Intel), CoreML (Apple) all accept ONNX as input. Common deployment pattern: train PyTorch → export ONNX → convert to runtime-specific format.
Microsoft ML ecosystem: ONNX Runtime is heavily used in Azure / .NET / Microsoft stack.

What makes LLMs hard for ONNX:

Dynamic shapes: LLM inputs are variable-length sequences. ONNX supports dynamic shapes but with overhead; many ONNX runtimes optimise better for static shapes. LLM serving uses variable batch sizes (continuous batching), making static shapes impossible.
KV cache management: the KV cache grows as tokens are generated; managing this efficiently requires runtime support for “stateful” inference. ONNX is designed for stateless models; adding cache state requires custom op extensions.
Sampling and search: generation involves sampling logits, then re-running with the new token. ONNX models traditionally do single-shot inference; loop-with-state is awkward.
Modern ops: FlashAttention, RoPE, RMSNorm, SwiGLU all need ONNX equivalents. The opset has been catching up but lags PyTorch.
Quantization formats: LLM-specific quantization (q4_K, IQ-quants) has no ONNX equivalent. ONNX has int8 / int4 quantization but with different semantics.
Performance: for LLM inference, specialised engines (vLLM, TensorRT-LLM, llama.cpp) outperform ONNX Runtime by 2-5×. ONNX Runtime hasn’t kept up with LLM-specific optimisations.

The practical state:

For LLMs, ONNX is largely bypassed. The pipeline is usually: PyTorch (training) → safetensors (storage) → vLLM / TensorRT-LLM / llama.cpp (inference). ONNX as an intermediate step adds friction without benefit.

For vision and traditional ML: ONNX remains the default deployment format. The CNN ecosystem hasn’t moved away.

For edge deployment: ONNX has competition from CoreML (Apple), TFLite (Google), and MediaPipe (mobile). ONNX is still relevant but not dominant.

The ONNX story: a successful interchange format for the 2018-era ML world, partially displaced by LLM-specific tooling but still valuable for cross-platform deployment of “classical” models.

↳ §22.3 + production deployment

END OF CH.22 — Runtimes and frameworks.
§1 (PyTorch dispatch stack: eager vs compiled, torch.compile speedups via fusion) · §2 (CUDA kernel structure, Triton DSL, MLX for Apple Silicon) · §3 (model formats: safetensors as default, GGUF for distribution, ONNX for cross-platform).

Next: Ch.23 — Inference at scale. KV cache, PagedAttention, continuous batching, vLLM internals.