Product quantization + binary codes
HNSW gives you fast TRAVERSAL but doesn’t reduce per-vector storage or distance-computation cost. At 1B vectors × 1024 dim × fp16: that’s 2 TB of memory just for the embeddings. Untenable on any single machine. Product quantization (PQ) (Jegou 2011) compresses each vector into a small 8-32 byte code, reducing memory by 50-100×. Binary embeddings push further: 1 bit per dimension, distance via popcount. Combined with HNSW or IVF (inverted file with coarse clustering), these form the production billion-scale ANN recipe used in FAISS, Milvus, and every major vector database.
Product Quantization (PQ)
The idea: split each d-dim vector into M subvectors of dim d/M; quantize each subvector to a learned codebook entry.
PQ is the workhorse of large-scale ANN. It compresses 50-200× while preserving most of the nearest-neighbor information.
Binary embeddings — 1 bit per dimension
The extreme version: 1 bit per element. Each vector becomes a d-bit binary string.
Budget:
100M vectors → 8 GB → 80 bytes per vector. (Going from 400B/vector down to 80B/vector — 5× compression.)
Wait, that’s modest compression. Let me recheck: fp32 = 4 bytes/dim × 1024 dim = 4096 bytes per vector. To fit in 80 bytes per vector: 51× compression.
PQ parameters:
If we use K = 256 codewords per subvector (1 byte per subvector index):
Storage per vector = M bytes (one byte per subvector).
To fit 80 bytes: M ≤ 80. Let’s use M = 64.
Subvector dim = 1024 / 64 = 16 dimensions per subvector.
Codebooks: M × K × d/M = 64 × 256 × 16 = 256K floats × 4 bytes = 1 MB total. Negligible.
Expected accuracy:
With M = 64, K = 256: each 16-dim subvector is quantized to one of 256 representative values. Quantization error per subvector ≈ 16 × σ² / 256 ≈ σ²/16 where σ is the typical subvector value scale.
Per-vector distance estimation error: ~ M × σ²/16 = 4σ². The total distance is roughly d × σ² = 1024σ². So distance error is ~0.4% of typical distance. Very good.
Empirically: PQ at M = 16-32 typically achieves 85-95% recall@10 on standard benchmarks; M = 64 gives 95%+.
Storage:
Per vector: 64 bytes.
Total for 100M: 6.4 GB. Fits in 8 GB budget with room for index structure.
Search cost:
Per query: build distance table (M × K = 16K float subtractions + sums = trivial).
Per database vector: 64 table lookups + 64 additions = ~128 cycles ≈ 40 ns on modern CPU.
For 100M lookups: 4 seconds linear scan. Still slow — need to combine with HNSW or IVF for efficient navigation.
Combining with IVF:
IVF partitions the database into ~10K clusters. Each query probes ~16-64 clusters (the closest to query). Reduces search space by 100-600×.
Combined: IVF + PQ = millions of vectors searched in ~10ms. Industry-standard recipe.
The full recipe (FAISS IVF+PQ):
- Train: cluster vectors into N_clusters centroids. Train PQ codebooks.
- Index: assign each vector to its nearest cluster. Store its PQ code in the cluster’s inverted list.
- Query: find query’s top K_probe nearest clusters. Search each cluster’s inverted list with PQ distance estimation. Return top-K overall.
Tunable: N_clusters (typically √N, e.g., 10K for 100M); K_probe (typically 16-128); M, K for PQ.
The lookup-table trick
The PQ distance computation is critical to understand:
Binary embeddings revisited
Modern advance: quantization-aware embedding training aligns the embedding training to be MORE TOLERANT of quantization. The Cohere “binary” and “int8” embedding offerings train with quantization in mind, achieving 80-90% recall at 1 bit/dim compression.
Per-pair Hamming distance:
For two 1024-bit binary vectors a and b:
1. XOR: a ⊕ b. Bits that differ become 1.
2. Popcount: count the 1 bits in the result.
On modern CPUs with POPCNT instruction:
- 1024 bits = 16 × 64-bit words.
- 16 × XOR (1 cycle each).
- 16 × POPCNT (3 cycles each on Intel).
- 15 × ADD to sum the popcounts.
- Total: ~80 cycles ≈ 25-30 ns per pair.
Why faster than dot product:
fp32 dot product for 1024 dims: 1024 multiplies + 1023 adds = ~2000 FLOPs. Modern CPU: ~500-1000 ns due to memory bandwidth (must load 4 KB of vector data).
Binary Hamming: only 128 bytes of vector data; processed at hardware-popcount speed. ~25 ns.
Speedup: 20-40× for in-cache data.
GPU equivalent:
NVIDIA H100 has dedicated POPC instructions and SIMD popcount:
- Each thread can process 32 bits per cycle with __popc.
- For 1024-bit vector: 32 threads × 32 bits per cycle = 32 cycles.
- At 1.5 GHz: ~21 ns per pair per thread.
- With 132 SMs × 64 warps × 32 threads in parallel: billions of distance pairs per second per H100.
Specialised libraries (FAISS GPU, cuVS) use this for high-throughput binary search.
When you’d use binary:
- Massive scale (1B+ vectors).
- CPU-only deployment (binary is much friendlier to commodity hardware than fp16 PQ).
- First-pass filtering: find candidates with binary, re-rank with exact distance.
- Storage-constrained systems (mobile, edge).
FAISS IVF+PQ — the production recipe
FAISS IVF+PQ is the canonical production recipe, combining IVF (inverted file) with PQ:
Setup:
10B vectors, 1024 dim. fp32 raw = 40 TB. Doesn’t fit on one machine.
Targets: 1K QPS, sub-50ms latency, 90%+ recall.
Compression:
PQ with M = 32 codes per vector, K = 256 codewords. Each vector → 32 bytes.
10B × 32 bytes = 320 GB. Fits on a single high-memory CPU server (or 4 H100s).
Expected recall@10 with PQ32: ~92-95%.
IVF clustering:
N_clusters = √N ≈ 100K clusters.
Each cluster has ~100K vectors on average.
K_probe = 64 (probe 64 nearest clusters per query). Searches ~6.4M vectors per query (down from 10B — 1500× reduction).
Per-query work: scan 6.4M PQ codes = 6.4M × 32 byte lookups = ~200 MB of memory accessed. At 100 GB/s memory bandwidth: 2 ms.
Plus cluster centroid search: K_probe nearest among 100K centroids = brute-force with PQ-compressed centroids ~ 1 ms.
Total per query: ~3-5 ms.
Hardware:
- 1 high-memory server (1 TB RAM) holds the full PQ index.
- Or 2-4 GPUs with FAISS GPU backend (faster per-query but smaller capacity per GPU).
- 1K QPS throughput: ~1 server can handle if per-query ~3 ms (~300 QPS per core × 8 cores).
Replicate for redundancy + horizontal scaling: 3-5 servers serving 1K QPS each.
Refinement (optional):
For top-K candidates from PQ search, re-rank with EXACT distance on uncompressed vectors. This recovers some recall lost to PQ.
If we keep the full 1024-dim vectors on a separate storage tier (NVMe or even compressed), we can do re-ranking on the top 100 candidates per query.
Re-ranking cost: 100 × 1024-dim dot products = trivial. Adds 1 ms per query.
Final recall@10: 95-97% with re-ranking.
What this costs:
- Storage: 320 GB PQ index + ~40 TB optional full embeddings (cheap, on NVMe).
- Compute: 3-5 servers × $5K/year cloud cost = ~$25K/year.
- Throughput: 1K QPS sustainable. 10K QPS achievable with more replicas.
For perspective: this would have been impossible in 2015 (memory and compute). PQ + IVF + HNSW have democratised billion-scale ANN.
Next: §26.3 — RaBitQ and TurboQuant. The very recent rotation-based quantization techniques that brought us back to the conversation that started this book.