Three fine-tuning approaches — full, feature, PEFT
You have a 70B-parameter base model and a small dataset that the base doesn’t handle well — maybe a domain (legal, medical), a style (matched to your brand), or a task (specific classification structure). How do you adapt the model? Three families have emerged, each defining a different trade-off between adaptation power, memory cost, and deployment complexity. Full fine-tuning updates every parameter and gives maximum expressivity but costs as much memory as the original training. Feature extraction freezes the backbone and trains a new head — cheap but limited. Parameter-efficient fine-tuning (PEFT) sits in the middle: freeze most of the model, train tiny adapters. Of these, PEFT (specifically LoRA + QLoRA from Ch.24) now dominates production fine-tuning. This section maps the trade-offs and explains why.
Full fine-tuning
The most direct approach: continue training every parameter on the new dataset.
Full fine-tuning has maximum expressivity — every parameter can change to fit the new task. But the memory and compute cost is on the same order as the original pretraining. For a 70B model:
Full FT is rarely used today outside of frontier labs. The cost is too high for most use cases, and PEFT methods match full-FT quality on most benchmarks at 1% the cost.
Feature extraction (linear probing)
The other extreme: freeze the entire backbone; train only a new “head” on top of frozen representations.
Feature extraction dominated the pre-2020 transfer learning era (think: BERT + linear classifier). It’s still the default for sentence embedding models and for any task where the pretrained representations are already good and you just need a final decision layer.
The limit: the backbone can’t adapt to the new task. If the pretrained representations don’t expose the right features, no amount of head training will fix it. For tasks far from the pretraining distribution, feature extraction underperforms badly.
Full fine-tuning:
- Trainable: ALL backbone parameters + (optional) new head.
- Frozen: nothing.
- Memory: ~4× model size (weights + grads + Adam state).
Feature extraction (linear probing):
- Trainable: a new small head (typically d_model · n_classes parameters).
- Frozen: entire backbone.
- Memory: ~1× model size (just inference, no gradients).
Parameter-efficient fine-tuning (PEFT):
- Trainable: small additional modules (LoRA adapters, prefix tokens, etc.). Typically ≤ 1% of backbone params.
- Frozen: entire backbone.
- Memory: ~1.05× model size (inference + tiny gradients/Adam for the adapters).
The memory hierarchy is roughly: PEFT ≈ Feature ≪ Full. The expressivity hierarchy is: Full > PEFT ≫ Feature. PEFT is the sweet spot — close to full-FT quality at near-feature-extraction cost.
Parameter-efficient fine-tuning (PEFT)
The middle path: freeze the backbone (huge memory savings) but add small trainable modules that can adapt the model’s behaviour without changing the bulk.
PEFT is now the dominant fine-tuning paradigm. The specific methods within PEFT (LoRA, adapters, prefix tuning, IA³, prompt tuning) differ in WHERE they inject the trainable component:
- LoRA / QLoRA: inject low-rank delta matrices into linear layers (most common).
- Adapters: insert small bottleneck MLPs between transformer blocks.
- Prefix / Prompt tuning: prepend trainable “soft” tokens to the input.
- IA³: scale attention and FFN activations by per-feature multipliers.
§19.2 covers LoRA in depth (the math, the intrinsic-dim argument, the kernel). §19.3 surveys the rest of the PEFT family.
(a) Sentiment classification on movie reviews:
Feature extraction wins. Reason: sentiment is a low-level semantic feature already captured well by pretrained BERT-style representations. A linear classifier on top can reach ~92% on IMDb. Full fine-tuning gets to ~94% but at vastly higher cost. The marginal 2% rarely justifies the 1000× compute. This was the canonical case for “freeze and probe” through 2019.
(b) Medical-domain summarization:
Full fine-tuning wins, with a caveat. Reason: medical text is FAR from the pretraining distribution. Pretrained representations don’t cleanly separate medical entities (drug names, conditions, dosages). Feature extraction with a frozen backbone fails. Full FT can adapt the backbone’s mid-level representations to capture medical structure. Even better: domain-adaptive pre-training (continued pretraining on medical text) followed by task fine-tuning. PEFT methods barely existed at the time; once they emerged, they replaced full FT here at 1% the cost.
(c) Code completion in a new programming language:
Full fine-tuning, but PEFT now matches it. Reason: a new language has new syntax (token-level patterns) that the model needs to internalise. Feature extraction can’t adapt because syntax recognition is happening throughout the layers. Full FT can shift the model’s “what does code look like” prior toward the new language. In practice, modern PEFT (LoRA at rank 32+) closes the gap with full FT and is the default choice.
The general principle (pre-PEFT):
If the task uses features the backbone already represents well → feature extraction.
If the task requires shifting internal representations → full fine-tuning.
The PEFT revolution showed that “shifting internal representations” can usually be done with a small additive correction (LoRA) rather than rewriting every parameter. The Aghajanyan 2020 intrinsic-dim result (§19.2) is what made this case rigorous.
When each approach is right
The PEFT vs full FT decision is now mostly economic. PEFT loss vs full-FT loss is typically within 1-3% on standard benchmarks; the cost difference is 100-1000×. Unless you’re at the frontier and that 1-3% buys real production value, PEFT wins.
Budget analysis:
1 H100 = 80 GB HBM, ~1 PFLOP/s, ~$50K of compute over 2 weeks.
70B model in fp16 = 140 GB. Doesn’t fit. In 4-bit: ~35 GB. Fits with room for activations and adapters.
Full fine-tuning needs ~1.1 TB just for parameter state. Impossible on 1 H100.
Feature extraction: ~35 GB for inference + tiny head. Possible but the head can only adapt the final layer’s representations, which won’t capture deep domain knowledge.
QLoRA: 35 GB base + ~5 GB for LoRA adapters + Adam state. Fits comfortably.
Decision: QLoRA on the 4-bit base.
Workflow:
- Load the 70B base in 4-bit NF4 (35 GB). Use Ch.24 §3 QLoRA setup.
- For each of the 5 customer datasets, train a SEPARATE set of LoRA adapters (typically rank 16-64). Each set is ~0.5-1 GB.
- Each customer’s training run: ~2 days on the H100 for a 10K-example dataset. 5 customers × 2 days = 10 days. Fits in 2-week budget.
Deployment strategy:
For multi-tenant serving, the most powerful pattern is shared base + per-customer adapter swap:
- Load the 70B base once in 4-bit on the inference GPU (~35 GB).
- For each request, dynamically load the customer’s LoRA adapter (~0.5-1 GB) — much smaller than the base.
- Apply the adapter as W_eff = dequantize(W_base) + B · A; serve the response.
- Adapter swap is fast (~100 ms cold; ~10 ms cached).
Result: 5 customers, 5 customer-specific behaviour profiles, served from a single 80 GB GPU. Without PEFT, this would require either 5 separate fine-tuned models (5 · 35 GB = 175 GB, doesn’t fit) or fine-tuning a single model on all 5 customers (which produces “average” quality across all).
The deeper lesson: PEFT changes the deployment economics of fine-tuning. Per-customer customisation becomes affordable; multi-tenant serving becomes practical. The shared-base + swap-adapter pattern is what powers many “custom model” SaaS offerings today.
Next: §19.2 — LoRA. The full math behind why a rank-8 update matrix captures most of the fine-tuning signal, via Aghajanyan 2020’s intrinsic-dimensionality result.