FINE-TUNING, LORA, AND PEFT
Section 19.3
03

The PEFT family — adapters, prefix, IA³, prompt tuning

LoRA dominates production fine-tuning, but it’s one method in a family that emerged around 2019-2022 with a common theme: how do you adapt a frozen pretrained model with minimal new parameters? Each method makes a different bet about WHERE the adaptation should happen — in the residual stream (adapters), in the input embedding space (prefix tuning, prompt tuning), in the linear layer updates (LoRA), or in the activation scales (IA³). This section walks the major variants, compares their architectures and parameter counts, and explains why LoRA + QLoRA became the production default while the others occupy specific niches.

Adapters — bottleneck MLPs between blocks

Houlsby 2019 “Parameter-Efficient Transfer Learning for NLP” introduced the first PEFT method that worked: adapters.

Houlsby adapter (inserted in each transformer block): x_in → attention(LN(x_in)) → +x_in → ADAPTER → FFN(LN(...)) → +... → x_out ↓ Adapter(x) = x → W_down(d_model × d_a) → GeLU → W_up(d_a × d_model) → + x (residual) Trainable parameters per adapter: 2 · d_model · d_a Inserted once per block (sometimes twice — after attention AND after FFN). d_a typically 64 or 128 (vs d_model = 4096).

Adapters are conceptually clean: they insert a small “tuneable block” into each transformer layer that can modify the residual stream as it flows through. The backbone never changes; only the adapters do.

The downside: adapters insert NEW computation that runs at inference time (you have to pass through the bottleneck MLP). This adds ~5-10% inference latency. LoRA’s “deltas can be merged into the original weight” trick eliminates this overhead — a key reason LoRA won.

Prefix tuning and prompt tuning — trainable soft tokens

Li 2021 “Prefix-Tuning” introduced a fundamentally different approach: train a sequence of soft tokens that prepend to the input.

Prefix tuning: For each transformer block, prepend trainable "prefix" vectors to the K and V in attention (acting like "virtual tokens" that always appear at the start of the sequence). K = [K_prefix; K_actual] K_prefix ∈ ℝ^{n_prefix × d_k} TRAINABLE V = [V_prefix; V_actual] V_prefix ∈ ℝ^{n_prefix × d_v} TRAINABLE The Q (queries) are unchanged. The attention has the same form but with n_prefix additional keys+values that the queries can attend to. Trainable parameters: n_prefix · (d_k + d_v) per attention layer. For Llama 7B with n_prefix = 20: ~80K trainable params per layer × 32 layers ≈ 2.5M. Prompt tuning (Lester 2021): simpler variant. - Only prepend trainable input embeddings to the input embedding sequence. - Each transformer block sees the prefix through standard attention; no separate K, V per block. - Trainable parameters: n_prefix · d_model ≈ 80K total. - Smaller and simpler than prefix tuning; performance gap with larger models.

Prefix tuning and prompt tuning are the most “linguistic” of the PEFT methods — they’re literally “training the best prompt for the task” in continuous embedding space rather than discrete tokens. The model itself is unchanged; only the input is modified.

— think, then check —

Structural difference:

Prompt tuning: trains n_prefix vectors in the INPUT EMBEDDING space. These vectors are prepended to the input embeddings before any transformer block processes them. The standard transformer layers then operate on the modified embedding sequence — but the model itself is unchanged.

Trainable parameters: n_prefix · d_model. For Llama 7B with n_prefix = 20: 20 · 4096 = 80K parameters total.

Prefix tuning: trains separate K, V vectors at EVERY ATTENTION LAYER. The prefix is “inserted” into each layer’s attention computation as extra keys and values that all queries attend to. The trainable parameters per layer are 2 · n_prefix · d_k.

Trainable parameters: n_prefix · 2 · d_k · n_layers. For Llama 7B with n_prefix = 20: 20 · 2 · 128 · 32 = 164K parameters.

Prefix tuning has more parameters and reaches deeper into the model.

Why prompt tuning works at 10B+:

At sufficient scale, the pretrained model has rich representational capacity. Prepending a few trained “soft tokens” gives the model enough latitude to adapt its behaviour for the task — the embedding space is expressive enough that a small change at the input can shift the model’s behaviour substantially.

At smaller scale, the pretrained model’s representations are tighter — small input changes don’t propagate to large behaviour changes. Prompt tuning struggles because the input-only adaptation isn’t enough to reshape the model’s effective behaviour.

Prefix tuning bridges the gap at smaller scales:

By injecting trainable K, V at every attention layer, prefix tuning gives the model multiple “leverage points” to shape behaviour — at every layer, the model has new “tokens” to attend to. This is more invasive than prompt tuning and works at sizes where prompt tuning fails.

Why both lost to LoRA:

LoRA modifies the WEIGHT MATRICES directly (in a low-rank way) — a fundamentally more expressive change than modifying the input or K, V. At any scale, LoRA outperforms or matches prefix/prompt tuning, with similar or fewer parameters.

The remaining niche for soft tokens: composition. You can stack multiple learned prefixes (“be helpful” + “use formal language” + “answer in French”) — this combinatorial property is harder with LoRA, where adapters tend to interfere when combined.

IA³ — multiplicative activation scaling

Liu 2022 “Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning” introduced an even smaller variant: IA³.

IA³ scales: For each transformer block, learn three diagonal scale vectors: ℓ_K ∈ ℝ^{d_k} scales the keys (K_scaled = ℓ_K ⊙ K) ℓ_V ∈ ℝ^{d_v} scales the values (V_scaled = ℓ_V ⊙ V) ℓ_FF ∈ ℝ^{d_ffn} scales the FFN hidden activation (h_scaled = ℓ_FF ⊙ h) Each is a DIAGONAL vector — just element-wise multiplications, no full matrix. Trainable parameters per layer: d_k + d_v + d_ffn ≈ d_model + d_model + 2.7 · d_model ≈ 5 · d_model For Llama 7B: 5 · 4096 · 32 layers ≈ 650K trainable parameters. This is ~5-10× fewer parameters than LoRA at rank 8.

IA³’s bet: the most impactful adaptation is REWEIGHTING existing features, not introducing new directions. Multiplying the K, V activations by per-feature scales tells the model “pay more attention to feature 3 and less to feature 7” without changing what those features represent.

Empirically: IA³ works surprisingly well for small fine-tuning datasets (1K-10K examples), where its small parameter count is enough. For larger fine-tuning data (>50K examples), LoRA pulls ahead.

The LoRA variants

LoRA has spawned its own family of refinements:

These improvements are each 1-5% better than vanilla LoRA on specific benchmarks; none has fundamentally replaced the original.

— think, then check —

Houlsby adapter structure:

A small bottleneck MLP inserted INTO the residual stream of each transformer block:

x → W_down (d_model × d_a) → GeLU → W_up (d_a × d_model) → + x

The adapter REPLACES nothing; it ADDS a new computation path. Each block’s effective forward becomes: attention → +x → adapter → FFN → +x.

Trainable params per adapter: 2 · d_model · d_a. Two per block (after attention and after FFN). For Llama 7B, d_a = 64: ~1M total trainable params.

LoRA structure:

A low-rank delta added to LINEAR LAYERS:

W_eff = W_base + B · A

LoRA doesn’t add a new computation path; it modifies existing matmuls. The transformer block’s forward is unchanged in shape; only the weights are augmented.

Trainable params per LoRA-applied layer: 2 · d · r. For Llama 7B with r = 8 on 7 linear layers per block × 32 blocks: ~5M params.

Inference cost difference:

Adapters: the bottleneck MLP runs at INFERENCE TIME. Adds 2 · d_model · d_a FLOPs per token per block adapter. For Llama 7B with d_a = 64: 2 · 4096 · 64 · 2 (two adapters per block) · 32 layers · per token = ~33M FLOPs per token of extra cost — ~5-10% latency overhead.

LoRA: at TRAINING time, the LoRA delta is computed alongside the base. At INFERENCE time, you can MERGE B · A into W_base: W_merged = W_base + B · A. The merged matrix is the same shape as W_base; inference cost is identical to the original. Zero overhead.

This merging property is the KEY operational reason LoRA won. You fine-tune cheaply, then deploy with zero overhead. Adapters require modified inference code AND pay runtime cost forever.

What you give up with LoRA:

You can only apply LoRA to LINEAR LAYERS (since W = B · A is a low-rank matrix decomposition). Adapters can adapt arbitrary points in the residual stream and capture non-linear transformations. For tasks requiring genuinely new representations (rare in practice), adapters might be more expressive.

For 99% of fine-tuning tasks, LoRA’s expressivity is enough and the zero-overhead inference makes it the production default.

Why LoRA + QLoRA dominate

Production fine-tuning method share (rough 2024 estimate): Method Share Use case LoRA + QLoRA ~85% Default for any custom fine-tune Full FT ~10% Frontier labs only Feature extraction ~3% Sentence embedding models, classification heads IA³ / adapters ~1% Research, very small datasets Prefix / prompt tuning ~1% Composition use cases, soft prompt search

The case for LoRA + QLoRA, summarised:

  1. Cheap. ~0.5-2% of full-FT parameters trainable. Combined with QLoRA’s 4-bit base, fine-tuning a 70B model on a 24GB GPU is routine.
  2. Fast. Hours, not days. ~10× faster training per epoch than full FT.
  3. Zero inference overhead. Merge B · A into W_base at deploy time; no runtime cost.
  4. Multi-tenant friendly. One frozen 4-bit base; swap LoRA adapters per customer / per task. Scales to thousands of customised behaviours from one base.
  5. Robust to overfitting. The rank constraint acts as a regulariser. Often outperforms full FT on small fine-tuning datasets.
  6. Tooling. Hugging Face PEFT library, plus every fine-tuning tutorial on the internet, defaults to LoRA. The ecosystem is mature.
— think, then check —

(a) Customer-support chatbot for 70B base, 5K examples:

Pick: QLoRA at rank 16-32.

Reasoning: 5K examples is a “small” dataset by LLM standards. QLoRA on a 4-bit base means the whole thing fits on a single 24-48 GB GPU. Rank 16-32 is enough for chat-style behavioural changes; lower rank might miss some style adaptations. The rank-acting-as-regulariser property helps on small data (less overfitting).

Why not full FT: 70B requires multi-GPU; 5K examples are too few to justify the cost; risk of catastrophic forgetting on small data.

Why not adapters: inference overhead would compound across millions of customer interactions.

Why not prefix tuning: works less well at 70B for behavioural adaptation; LoRA captures more.

(b) Sentence embedding model from 1B base:

Pick: Feature extraction + new pooling head.

Reasoning: sentence embedding is fundamentally a different task structure than autoregressive generation. The 1B base provides rich representations; you don’t need to ADAPT them — you need to EXTRACT them at a specific layer and pool them into a sentence vector.

The training task is contrastive (cosine similarity of paired vs unpaired sentences). The “head” is a small pooling layer (mean / [CLS] / weighted average) plus optional MLP. Training only the head is sufficient.

Why not LoRA: LoRA modifies the generative behaviour, which isn’t what you want. The pretrained representations are already good; you just need to learn how to POOL them.

Why not full FT: training the whole backbone for embedding is overkill; sentence embedding doesn’t need backbone changes (proven empirically — Sentence-BERT, E5, etc. all use frozen backbones).

(c) Composable persona fine-tune:

Pick: Prefix tuning (or modular LoRA variants like LoRAHub).

Reasoning: the goal is COMPOSITION — different personas should stack (“be helpful” + “use formal language” + “answer like a doctor”). LoRA adapters tend to interfere when stacked: the deltas don’t compose linearly because they affect overlapping weight matrices.

Prefix tuning is naturally composable: prepend “persona 1 prefix” + “persona 2 prefix” + … at the input. The model attends to all prefixes, weighing them by attention scores. Researchers have shown this composes more cleanly than stacked LoRAs.

Newer alternatives: LoRAHub (Huang 2023), which trains LoRAs to be COMPOSED via a learned mixture network. Or DoRA which has cleaner decomposition properties. But for a production system, prefix tuning’s structural composability is hard to beat.

The pattern: LoRA + QLoRA is the right default, but specific use cases (composition, embedding) have better-suited alternatives. The PEFT zoo isn’t obsolete; it’s specialised.

END OF CH.19 — Fine-tuning, LoRA, and PEFT.
§1 (three approaches: full FT, feature extraction, PEFT) · §2 (LoRA math, Aghajanyan intrinsic-dim, lora.c showing rank-4 sufficiency) · §3 (the PEFT family: adapters, prefix, prompt, IA³, and why LoRA + QLoRA dominate).

END OF PART IV — What Makes an LLM. Pretraining, MoE, alignment, fine-tuning — the full lifecycle of producing and adapting an LLM. From here we move to Part V (the systems that run them): hardware, runtimes, inference, training at scale, quantization (already done), vector search.