FINE-TUNING, LORA, AND PEFT
Section 19.2
02

LoRA — the math and the intrinsic-dim argument

The big LoRA question: can you fine-tune a 70-billion-parameter model by training only 67 million parameters (a rank-8 update)? Empirically yes, and the empirical answer is dramatic — LoRA at rank 8-32 matches or exceeds full fine-tuning on most downstream benchmarks. But WHY does this work? The architecture argument alone (low-rank matrices have fewer parameters) doesn’t explain why low rank is enough — there’s no a priori reason the “true” fine-tuning delta should be low-rank. The structural argument comes from Aghajanyan 2020 “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning”, which showed empirically that fine-tuning lives on a low-dimensional manifold of parameter space — typically 200-1000 dimensions out of the model’s billions. This section walks the LoRA formulation, the intrinsic-dim experiment that justifies it, and the kernel that demonstrates rank-4 sufficiency on a synthetic task.

The formulation

Standard fine-tuning of a linear layer y = x·W updates W:

Standard fine-tuning: W_FT = W_base + ΔW ΔW ∈ ℝ^{d × d} (or d_in × d_out) Trainable parameters: d² (the full ΔW). LoRA fine-tuning: W_LoRA = W_base + B · A B ∈ ℝ^{d × r}, A ∈ ℝ^{r × d} Trainable parameters: 2 · d · r ≪ d² when r ≪ d. Forward pass: y = x · W_LoRA = x · W_base + x · B · A \___________/ \___________/ FROZEN base LoRA contribution At inference, compute as y = x·W_base + (x·B)·A — same total cost as x·W.

LoRA exploits the assumption that the fine-tuning delta has low effective rank. This assumption is empirical (justified by Aghajanyan 2020 below), not a priori — but it’s overwhelmingly validated by every LoRA experiment to date.

The intrinsic-dim argument (Aghajanyan 2020)

The deeper “why does LoRA work” question is answered by Aghajanyan 2020. Their experimental setup:

Aghajanyan's intrinsic-dim experiment: 1. Take a pretrained model with parameters θ ∈ ℝ^D (D ~ billions). 2. Restrict fine-tuning to a d-dimensional subspace: θ_FT(t) = θ_base + P · t where P ∈ ℝ^{D × d} (a random projection) and t ∈ ℝ^d is the trainable vector. 3. For various d, fine-tune on a target task and measure performance. 4. Find the smallest d that recovers ≥90% of full fine-tuning performance. Empirical finding: d* (the "intrinsic dimension") is much smaller than D. For BERT-base on text classification: d* ≈ 200 For BERT-large on text classification: d* ≈ 700 For GPT-2 on language modeling: d* ≈ few thousand The intrinsic dimension SHRINKS as the pretrained model gets larger.

Intrinsic dimension formalises the intuition that fine-tuning explores a low-dimensional manifold. The result has two parts:

  1. The intrinsic dimension is small — usually orders of magnitude smaller than the model’s total parameter count.
  2. Larger models have smaller intrinsic dimensions — the more pretraining capacity, the less the fine-tuning needs to “move” the model.

The second part is the surprising one. You might expect larger models to need more fine-tuning updates; they actually need FEWER. The pretraining has done more of the work; the fine-tuning just needs to point the right direction.

— think, then check —

What the 700 number means:

If you constrain the fine-tuning update to lie in a random 700-dimensional subspace of the 340M-dimensional parameter space, you recover at least 90% of full fine-tuning’s task accuracy.

The “subspace” is defined by a random projection: θ_FT = θ_base + P · t, where P is a fixed random 340M × 700 matrix and t is a 700-dimensional trainable vector. You’re literally training only 700 numbers.

Result: the model fine-tunes to near-full performance using just those 700 trainable scalars. The “true” fine-tuning delta needed for this task LIVES IN a 700-dim subspace of the full parameter space.

Why larger models have smaller intrinsic dim (surprising):

The intuition: at small scale, the pretrained model “knows” only a small amount; fine-tuning has to teach it new things, requiring many parameter changes. At large scale, the pretrained model already knows most of what’s needed; fine-tuning only has to STEER the existing knowledge.

Concretely: imagine the model’s parameter space as a landscape of “useful configurations.” At small scale, the configurations are spread thin; you have to MAKE many changes to find one good for your task. At large scale, the landscape has many densely-packed good configurations; you only need a SMALL perturbation to find one matching your task.

Mathematically: the pretrained model’s “knowledge manifold” is much richer at larger scale, so the new task is closer to the existing manifold (lower distance to fix).

Implication for LoRA rank:

If the intrinsic dimension of fine-tuning a 7B model is ~1000, and the model has, say, 200 linear layers each with d × d ≈ 4096² = 16M parameters, then the intrinsic dimension is spread across these layers. A natural distribution: ~5 dimensions of update per layer.

LoRA at rank r per layer captures r² parameters per layer (the rank-r subspace has r² degrees of freedom in update). So r = 4-8 is plausibly enough per layer.

Empirically: LoRA at rank 8-16 matches full fine-tuning quality across most benchmarks for 7B+ models. The intrinsic-dim argument explains why this rank is enough.

For larger models (70B+), LoRA rank 4-8 often suffices. For smaller models (1B-3B), rank 32-64 may be needed. This matches the prediction: larger pretrained model = lower intrinsic dim per layer = lower LoRA rank needed.

Now make it run

The kernel from this section trains a tiny linear layer two ways: (1) full fine-tuning of the whole d_in × d_out weight matrix, (2) LoRA at ranks 1, 2, 4, 8, 16. The “task delta” is constructed to be intrinsically rank-4, matching the Aghajanyan claim.

lora.c — train_lora C · LoRA vs full FT, rank sweep
/* Train LoRA: W_eff = W_base + B · A. Only A, B updated. */
static float train_lora(const float* X, const float* Y, const float* W_base,
                          int rank, float* W_eff_out)
{
    float* A = malloc(rank * D_IN  * sizeof(float));    /* A: r × d_in */
    float* B = malloc(D_OUT * rank * sizeof(float));    /* B: d_out × r */
    float* gradA = malloc(rank * D_IN  * sizeof(float));
    float* gradB = malloc(D_OUT * rank * sizeof(float));
    float* gradW_eff = malloc(D_IN * D_OUT * sizeof(float));
    float* W_eff = malloc(D_IN * D_OUT * sizeof(float));

    /* A initialised Gaussian; B initialised to ZERO (standard LoRA init).
       This ensures the LoRA delta starts at 0 — fine-tuning begins exactly at base. */
    for (int i = 0; i < rank * D_IN ; i++) A[i] = 0.1f * normalf();
    for (int i = 0; i < D_OUT * rank; i++) B[i] = 0.0f;

    float loss = 0;
    for (int e = 0; e < N_EPOCHS; e++) {
        /* Compute W_eff[k, j] = W_base[k, j] + sum_r B[j, r] · A[r, k] */
        for (int k = 0; k < D_IN; k++)
            for (int j = 0; j < D_OUT; j++) {
                float delta = 0;
                for (int r = 0; r < rank; r++) delta += B[j * rank + r] * A[r * D_IN + k];
                W_eff[k * D_OUT + j] = W_base[k * D_OUT + j] + delta;
            }

        loss = forward_mse(X, Y, W_eff, gradW_eff, N_TRAIN, 1);
        /* Chain rule:
           ∂L/∂A[r, k]  =  sum over j of  (∂L/∂W_eff[k, j]) · B[j, r]
           ∂L/∂B[j, r]  =  sum over k of  (∂L/∂W_eff[k, j]) · A[r, k]   */
        memset(gradA, 0, rank * D_IN  * sizeof(float));
        memset(gradB, 0, D_OUT * rank * sizeof(float));
        for (int k = 0; k < D_IN; k++)
            for (int j = 0; j < D_OUT; j++) {
                float g = gradW_eff[k * D_OUT + j];
                for (int r = 0; r < rank; r++) {
                    gradA[r * D_IN + k] += g * B[j * rank + r];
                    gradB[j * rank + r] += g * A[r * D_IN + k];
                }
            }
        for (int i = 0; i < rank * D_IN ; i++) A[i] -= LR * gradA[i];
        for (int i = 0; i < D_OUT * rank; i++) B[i] -= LR * gradB[i];
    }

    /* Compute final effective W and copy out. */
    if (W_eff_out) {

Output:

LoRA vs Full Fine-Tuning on a synthetic linear task
d_in=32, d_out=16, N=256, 200 epochs

scheme         rank  trainable params  loss after FT
base (no FT)    -            0          134.76645
full FT         -      512            3.98511
LoRA r=1        1       48            77.36503
LoRA r=2        2       96            28.79549
LoRA r=4        4      192            0.05982
LoRA r=8        8      384            0.04649
LoRA r=16      16      768            0.04294

Two observations:

  1. LoRA r=4 captures the task essentially perfectly (loss 0.06) — because the true task delta is rank 4. LoRA at the matching rank recovers the truth.
  2. LoRA r ≥ 4 BEATS full FT (0.06 vs 3.99). Why? Full FT has 512 trainable parameters but the task only needs 192 (= 4 · (32+16)). The extra 320 free parameters in full FT OVERFIT to the noise in the small N=256 training set. LoRA at the right rank acts as a regulariser — it can’t overfit beyond its rank constraint.

This is the production reality: LoRA isn’t just “cheaper full FT,” it’s often BETTER full FT because the rank constraint prevents overfitting on small fine-tuning datasets.

— think, then check —

LoRA forward pass:

y = x · W_base + x · (B · A) where:

  • W_base ∈ ℝ^(d_in × d_out) — the original weight, FROZEN.
  • A ∈ ℝ^(r × d_in) — first low-rank matrix, TRAINABLE.
  • B ∈ ℝ^(d_out × r) — second low-rank matrix, TRAINABLE.
  • r ≪ d_in, d_out — the LoRA rank (typically 4-64).

The effective weight is W_eff = W_base + B · A. The shape of B · A is d_out × d_in, matching the original W layout. (Note: ordering conventions vary — some papers use d_in × d_out, but the rank-r decomposition is the key.)

Trainable parameter count:

Full FT: d_in · d_out parameters (the entire W).

LoRA: r · d_in + d_out · r = r · (d_in + d_out) parameters.

Ratio: r · (d_in + d_out) / (d_in · d_out). For d_in = d_out = d:

Ratio = 2r/d.

For d = 4096 (Llama 2 7B) and r = 8: ratio = 16/4096 = 0.004 ≈ 0.4%. LoRA uses 0.4% of the parameters that full FT uses for that linear layer.

Per-Llama-block savings:

Each attention block has 4 linear layers (W_Q, W_K, W_V, W_O); each FFN has 3 (gate, up, down). LoRA typically applies to all of them (LoRA-Q+K+V+O+gate+up+down). 7 linear layers per block × 0.4% = ~2.8% of total backbone parameters become trainable.

For Llama 2 7B (6.7B params), this is ~187M trainable parameters in LoRA at rank 8. Versus 6.7B for full FT. ~36× fewer trainable params.

Initialisation — why B starts at zero

A small but important detail in the LoRA paper: B is initialised to zero, and A is initialised to a standard random distribution (e.g., Gaussian or Kaiming).

LoRA initialisation: A ~ Gaussian(0, σ²) (or Kaiming uniform) B = 0 (exactly zero) At step 0: B · A = 0 · A = 0 → LoRA contribution is zero. The model is EXACTLY the base. At step 1+: B has been updated by gradient, so B · A ≠ 0. The LoRA delta starts at zero and grows during training.

The motivation: at initialisation, you want the model to behave EXACTLY like the base. If both A and B were random, the LoRA delta B · A would be a random perturbation at step 0 — degrading the base model before any training. By setting B = 0, the LoRA delta is exactly zero at initialisation, and the model behaves identically to the base until the first gradient step. Fine-tuning then “discovers” the delta from a zero starting point.

— think, then check —

The asymmetric init:

A: standard random init (Gaussian or Kaiming).

B: exactly zero.

Result at step 0: B · A = 0 · A = 0. The LoRA contribution to the forward pass is zero. The model is EXACTLY the base.

Why this matters:

The first few gradient steps are critical. If the model’s behaviour at step 0 is already degraded (which random init of both A and B would cause), the initial gradients are computed from a “broken” starting point. Fine-tuning then has to first repair the degradation, then learn the task. The repair phase is wasted compute and can destabilise training.

Starting from B = 0 means: step 0 forward pass = base model. Step 1 forward pass = base + tiny correction. Fine-tuning gradually grows the correction in the right direction.

What goes wrong with symmetric small random init:

If A and B are both ~N(0, σ²) with small σ:

  • B · A has entries ~ O(r · σ²) — small but non-zero.
  • The LoRA contribution at step 0 is a random perturbation to the base.
  • The base’s pretrained representations are perturbed before any learning happens.
  • Initial loss is higher than the base model’s loss; fine-tuning has to recover the base, THEN learn the task.

Why not initialise both to zero:

If both A and B are zero, gradients are also zero:

∂L/∂A ∝ B = 0; ∂L/∂B ∝ A = 0.

The LoRA module is STUCK at zero and can’t learn. A standard problem in neural net init.

The fix: ASYMMETRIC init.

A ~ Gaussian (provides non-zero “directions” for B to explore).

B = 0 (ensures forward pass starts identical to base).

Gradient at step 0: ∂L/∂A ∝ B = 0 (A doesn’t move at step 0). ∂L/∂B = (∂L/∂W_eff) · A^T ≠ 0 (B starts learning immediately).

After a few steps, B has non-zero entries, and A starts learning too. Both matrices then co-evolve.

The asymmetric init is one of those “small detail that matters” tricks in modern deep learning — the LoRA paper noted it as essential for stable training; subsequent work has confirmed it across architectures.

Next: §19.3 — The PEFT family. Adapters (Houlsby 2019), prefix / prompt tuning, IA³, and the LoRA variants (LoRA+, DoRA, AdaLoRA). How LoRA + QLoRA came to dominate.