Three fine-tuning approaches — full, feature, PEFT

Section 19.1

Three fine-tuning approaches — full, feature, PEFT

You have a 70B-parameter base model and a small dataset that the base doesn’t handle well — maybe a domain (legal, medical), a style (matched to your brand), or a task (specific classification structure). How do you adapt the model? Three families have emerged, each defining a different trade-off between adaptation power, memory cost, and deployment complexity. Full fine-tuning updates every parameter and gives maximum expressivity but costs as much memory as the original training. Feature extraction freezes the backbone and trains a new head — cheap but limited. Parameter-efficient fine-tuning (PEFT) sits in the middle: freeze most of the model, train tiny adapters. Of these, PEFT (specifically LoRA + QLoRA from Ch.24) now dominates production fine-tuning. This section maps the trade-offs and explains why.

Full fine-tuning

The most direct approach: continue training every parameter on the new dataset.

Full fine-tuning: Initialize: θ = θ_pretrained For each step: Compute loss on new data using all of θ. Update all of θ via gradient descent. Trainable parameters: ALL of θ. Memory: ~ 4× model size (weights + grads + Adam m, v). Compute: full forward + full backward + Adam step. Output: a fully new model with all parameters updated.

Full fine-tuning has maximum expressivity — every parameter can change to fit the new task. But the memory and compute cost is on the same order as the original pretraining. For a 70B model:

Full fine-tuning of 70B: weights: 70B · 4 bytes = 280 GB (fp32 master) gradients: 70B · 4 bytes = 280 GB Adam m, v: 70B · 8 bytes = 560 GB Total: 1120 GB Requires ~14 H100s just for parameter state. Plus activations. Cost: ~$50-100K of compute for a typical run.

Full FT is rarely used today outside of frontier labs. The cost is too high for most use cases, and PEFT methods match full-FT quality on most benchmarks at 1% the cost.

Feature extraction (linear probing)

The other extreme: freeze the entire backbone; train only a new “head” on top of frozen representations.

Feature extraction: Initialize: θ = θ_pretrained, h = new_head For each step: Run forward through frozen θ to get representations h_in. Train h (typically a small linear layer) on h_in. Trainable parameters: just the new head — typically d_model · n_classes. Memory: ~1× model size for inference + tiny extras for the head. Compute: forward pass (no backward through backbone). Output: same backbone + new head.

Feature extraction fine-tuning approach A fine-tuning method that FREEZES the pretrained backbone and trains only a new task-specific head (typically a small linear layer) on top of the backbone's representations. Extremely cheap (no backbone gradients) but limited expressivity — the model can't adapt internal representations to the new task. Used for sentence embeddings, classification heads on top of frozen LLMs, and as a strong baseline for any new task. dominated the pre-2020 transfer learning era (think: BERT + linear classifier). It’s still the default for sentence embedding models and for any task where the pretrained representations are already good and you just need a final decision layer.

The limit: the backbone can’t adapt to the new task. If the pretrained representations don’t expose the right features, no amount of head training will fix it. For tasks far from the pretraining distribution, feature extraction underperforms badly.

— think, then check —

Full fine-tuning:

Trainable: ALL backbone parameters + (optional) new head.
Frozen: nothing.
Memory: ~4× model size (weights + grads + Adam state).

Feature extraction (linear probing):

Trainable: a new small head (typically d_model · n_classes parameters).
Frozen: entire backbone.
Memory: ~1× model size (just inference, no gradients).

Parameter-efficient fine-tuning (PEFT):

Trainable: small additional modules (LoRA adapters, prefix tokens, etc.). Typically ≤ 1% of backbone params.
Frozen: entire backbone.
Memory: ~1.05× model size (inference + tiny gradients/Adam for the adapters).

The memory hierarchy is roughly: PEFT ≈ Feature ≪ Full. The expressivity hierarchy is: Full > PEFT ≫ Feature. PEFT is the sweet spot — close to full-FT quality at near-feature-extraction cost.

↳ §19.1 + transfer learning history

Parameter-efficient fine-tuning (PEFT)

The middle path: freeze the backbone (huge memory savings) but add small trainable modules that can adapt the model’s behaviour without changing the bulk.

PEFT (general formulation): Backbone weights θ_base : FROZEN. Adapter parameters φ : TRAINABLE. Typically |φ| ≪ |θ_base|. Modified forward pass: f(x; θ_base, φ) where the adapter modules are inserted into the original network. For each step: Compute loss using f(x; θ_base, φ). Update ONLY φ via gradient descent. θ_base does not change. Trainable params: ~ 0.1-1% of |θ_base|. Memory: ~ |θ_base| · 1 byte (inference precision) + |φ| · ~12 bytes (Adam state). Compute: forward through full network; backward only through adapter paths.

PEFT fine-tuning family A class of fine-tuning methods that freeze the pretrained backbone and train only a small set of additional parameters (adapters, prefix vectors, LoRA matrices, etc.). Trainable parameter count is typically 0.01-1% of the backbone. Memory cost is dramatically lower than full fine-tuning because the optimizer state, gradients, and parameter copies only need to track the small adapters. The dominant fine-tuning paradigm since 2021; LoRA (Hu 2021) is the de facto default within PEFT. is now the dominant fine-tuning paradigm. The specific methods within PEFT (LoRA, adapters, prefix tuning, IA³, prompt tuning) differ in WHERE they inject the trainable component:

LoRA / QLoRA: inject low-rank delta matrices into linear layers (most common).
Adapters: insert small bottleneck MLPs between transformer blocks.
Prefix / Prompt tuning: prepend trainable “soft” tokens to the input.
IA³: scale attention and FFN activations by per-feature multipliers.

§19.2 covers LoRA in depth (the math, the intrinsic-dim argument, the kernel). §19.3 surveys the rest of the PEFT family.

— think, then check —

(a) Sentiment classification on movie reviews:

Feature extraction wins. Reason: sentiment is a low-level semantic feature already captured well by pretrained BERT-style representations. A linear classifier on top can reach ~92% on IMDb. Full fine-tuning gets to ~94% but at vastly higher cost. The marginal 2% rarely justifies the 1000× compute. This was the canonical case for “freeze and probe” through 2019.

(b) Medical-domain summarization:

Full fine-tuning wins, with a caveat. Reason: medical text is FAR from the pretraining distribution. Pretrained representations don’t cleanly separate medical entities (drug names, conditions, dosages). Feature extraction with a frozen backbone fails. Full FT can adapt the backbone’s mid-level representations to capture medical structure. Even better: domain-adaptive pre-training (continued pretraining on medical text) followed by task fine-tuning. PEFT methods barely existed at the time; once they emerged, they replaced full FT here at 1% the cost.

(c) Code completion in a new programming language:

Full fine-tuning, but PEFT now matches it. Reason: a new language has new syntax (token-level patterns) that the model needs to internalise. Feature extraction can’t adapt because syntax recognition is happening throughout the layers. Full FT can shift the model’s “what does code look like” prior toward the new language. In practice, modern PEFT (LoRA at rank 32+) closes the gap with full FT and is the default choice.

The general principle (pre-PEFT):

If the task uses features the backbone already represents well → feature extraction.

If the task requires shifting internal representations → full fine-tuning.

The PEFT revolution showed that “shifting internal representations” can usually be done with a small additive correction (LoRA) rather than rewriting every parameter. The Aghajanyan 2020 intrinsic-dim result (§19.2) is what made this case rigorous.

↳ §19.1 + pre-PEFT era

When each approach is right

Choosing the approach (2024 rough rules): Task profile Approach ───────────────────────────────────────────────────────── Need maximum quality, have $$$ Full FT Want to fine-tune a 70B+ model LoRA / QLoRA Want to fine-tune on a single GPU QLoRA Have multiple tasks to serve LoRA (per-task adapters) Embedding model / sentence representations Feature extraction Classification head on top of frozen LLM Feature extraction In-context learning is enough No fine-tuning at all

The PEFT vs full FT decision is now mostly economic. PEFT loss vs full-FT loss is typically within 1-3% on standard benchmarks; the cost difference is 100-1000×. Unless you’re at the frontier and that 1-3% buys real production value, PEFT wins.

— think, then check —

Budget analysis:

1 H100 = 80 GB HBM, ~1 PFLOP/s, ~$50K of compute over 2 weeks.

70B model in fp16 = 140 GB. Doesn’t fit. In 4-bit: ~35 GB. Fits with room for activations and adapters.

Full fine-tuning needs ~1.1 TB just for parameter state. Impossible on 1 H100.

Feature extraction: ~35 GB for inference + tiny head. Possible but the head can only adapt the final layer’s representations, which won’t capture deep domain knowledge.

QLoRA: 35 GB base + ~5 GB for LoRA adapters + Adam state. Fits comfortably.

Decision: QLoRA on the 4-bit base.

Workflow:

Load the 70B base in 4-bit NF4 (35 GB). Use Ch.24 §3 QLoRA setup.
For each of the 5 customer datasets, train a SEPARATE set of LoRA adapters (typically rank 16-64). Each set is ~0.5-1 GB.
Each customer’s training run: ~2 days on the H100 for a 10K-example dataset. 5 customers × 2 days = 10 days. Fits in 2-week budget.

Deployment strategy:

For multi-tenant serving, the most powerful pattern is shared base + per-customer adapter swap:

Load the 70B base once in 4-bit on the inference GPU (~35 GB).
For each request, dynamically load the customer’s LoRA adapter (~0.5-1 GB) — much smaller than the base.
Apply the adapter as W_eff = dequantize(W_base) + B · A; serve the response.
Adapter swap is fast (~100 ms cold; ~10 ms cached).

Result: 5 customers, 5 customer-specific behaviour profiles, served from a single 80 GB GPU. Without PEFT, this would require either 5 separate fine-tuned models (5 · 35 GB = 175 GB, doesn’t fit) or fine-tuning a single model on all 5 customers (which produces “average” quality across all).

The deeper lesson: PEFT changes the deployment economics of fine-tuning. Per-customer customisation becomes affordable; multi-tenant serving becomes practical. The shared-base + swap-adapter pattern is what powers many “custom model” SaaS offerings today.

↳ §19.1 + production deployment

Next: §19.2 — LoRA. The full math behind why a rank-8 update matrix captures most of the fine-tuning signal, via Aghajanyan 2020’s intrinsic-dimensionality result.