ALIGNMENT: RLHF → DPO → GRPO
Section 18.1
01

SFT + RLHF — the classical alignment pipeline

Pretraining produces a next-token predictor — given a prefix, it outputs the most-likely continuation. That’s not what users want. A user asking “What’s the capital of France?” doesn’t want the most-statistically-likely web continuation (“…is a common geography quiz question. The answer is Paris…”). They want a direct, useful, polite answer (“The capital of France is Paris.”). The transformation from “likely text predictor” to “useful assistant” is alignment, and the canonical recipe from Ouyang 2022 (“Training Language Models to Follow Instructions”) — the InstructGPT paper — became the template for ChatGPT, Claude, Gemini, and every other deployed assistant. This section walks the three stages: SFT, reward modeling, and PPO with KL constraint. §18.2 derives the closed-form DPO simplification that replaced this whole pipeline for many use cases.

Stage 1 — Supervised Fine-Tuning (SFT)

The pretrained base model has seen the entire internet but has never seen the format “user prompt → assistant response.” SFT teaches the format.

SFT data: a dataset of (prompt, response) pairs, where the response is a high-quality demonstration of what an assistant should say. Example: prompt: "What is the capital of France?" response: "The capital of France is Paris." SFT loss: standard cross-entropy on the response tokens only. For each example (x_prompt, y_response): L_SFT = - Σ_{t in response tokens} log π_θ(y_t | y_{<t}, x_prompt) This is identical to pretraining's NTL loss, but the gradient only flows through the RESPONSE tokens — the prompt is "context" that doesn't get supervised.

SFT is fast and cheap: 1-3 epochs on a curated dataset of 10K-1M demonstrations. The output is a model that “knows it’s an assistant” — it produces responses in the right format, but quality is still uneven (hallucinations, harmful answers, mediocre helpfulness).

A high-quality SFT dataset costs a lot to build. InstructGPT used ~13K human-written demonstrations. Llama 2-chat used ~27K. Modern open SFT datasets (UltraChat, OpenOrca) use 100K-1M examples, mostly generated by sampling existing strong assistants — a form of distillation.

Stage 2 — Reward Model (RM)

SFT produces a model that responds in the right format but doesn’t know “what response is better.” For that, you need a notion of quality the model can optimize against.

Reward modeling data: preference comparisons. For each example: a prompt x, two responses y₁ and y₂, and a label saying which response is preferred by a human annotator (y_w = "winning", y_l = "losing"). Bradley-Terry model: assume preferences arise from a latent reward function r(x, y): P(y_w preferred over y_l) = σ(r(x, y_w) - r(x, y_l)) where σ is the sigmoid. Reward model: train a neural net r_φ(x, y) to predict this probability via maximum likelihood: L_RM = - E_{(x, y_w, y_l)} [ log σ(r_φ(x, y_w) - r_φ(x, y_l)) ] The trained r_φ scores any (prompt, response) pair on a real number. Architecturally: a transformer (often same as SFT base) with a scalar output head.

The Bradley-Terry model is the canonical statistical framework for preferences. It assumes preferences are noisy reflections of an underlying real-valued “quality” score; the probability that A beats B in a comparison is the sigmoid of the score difference.

Reward model training is operationally simple: same architecture as the LLM, but replace the unembedding with a 1-output linear head. Train on ~50K-1M preference pairs. The result is a function that scores any (x, y) pair on a real number; higher = more preferred.

Stage 3 — PPO with KL constraint

Now the heavy lifting: use the reward model to fine-tune the SFT model via reinforcement learning, while preventing it from “drifting” too far from the SFT reference.

PPO objective for RLHF: L_PPO = E_{x, y ∼ π_θ} [ r_φ(x, y) - β · KL(π_θ(· | x) ‖ π_ref(· | x)) ] where: π_θ = policy being trained (initialised from SFT) π_ref = reference policy = SFT model (frozen) r_φ = reward model (frozen) β = KL coefficient (controls how much the policy can deviate from ref) In practice, the KL is computed token-level and rolled into the reward: r_total(x, y_t) = r_φ(x, y_full) - β · log [π_θ(y_t | y_{<t}, x) / π_ref(y_t | y_{<t}, x)] Then standard PPO loss with clipping: L_PPO_clipped = E [ min(ρ_t · A_t, clip(ρ_t, 1-ε, 1+ε) · A_t) ] where ρ_t = π_θ(y_t)/π_old(y_t) is the importance-sampling ratio and A_t is the advantage (generalised advantage estimation from the value head).

The reward + KL formulation has a clean interpretation: maximise reward, but pay a penalty proportional to how much you’ve drifted from the SFT reference. The KL term is what prevents reward hacking — the policy finding exploits in the reward model rather than producing genuinely good outputs.

— think, then check —

Same formula: L = -Σ log π(y_t | y_prior, x). Difference is in WHAT it’s applied to.

Difference 1 — Data distribution:

Pretraining: massive corpus of internet text, books, code. Distribution = “things that exist on the internet.”

SFT: small curated dataset of high-quality (prompt, response) pairs. Distribution = “things you’d want an assistant to do.”

Difference 2 — Gradient mask:

Pretraining: gradient flows through ALL tokens. Every token is both context and supervision.

SFT: gradient flows through RESPONSE tokens only. Prompt tokens are pure context. This focuses learning on “how to respond” rather than “what prompts look like.”

Difference 3 — Format structure:

SFT examples have a strict template: system message (optional), user prompt, assistant response, special tokens delimiting each. The model learns this format becomes the structure of “an assistant conversation.”

Pretraining has no template — it’s just continuous text.

The model emerging from SFT knows “this is the response part; here’s how a good response looks; the prompt is context I should answer.” Pretraining alone produces a model that “continues text.”

— think, then check —

Why comparisons, not scores:

Humans are bad at absolute scoring (asking “rate this response 1-10” produces inconsistent annotations) but good at relative comparison (asking “which is better, A or B?” produces reliable annotations).

Bradley-Terry leverages this: model preference P(A > B) = σ(r_A - r_B). The reward function is identified up to comparisons, not absolute values.

Consequence for interpretation:

Only DIFFERENCES of reward are meaningful. r(x, y₁) = 5.7 and r(x, y₂) = 5.5 means y₁ is preferred to y₂. r(x, y₁) = 5.7 alone means nothing — it’s the same as if all rewards were 1000 higher.

The reward function has a free constant (you can add any C to r and the Bradley-Terry probability is unchanged).

The reward function may not be well-calibrated as a “quality score” in any absolute sense — only as a comparative measure between responses to the same prompt.

How the reward model is used:

In PPO: the reward is added to each generated trajectory’s return. For two trajectories, the reward DIFFERENCE drives the gradient signal. This is fine — the relative reward is what matters for picking better continuations.

In DPO (§18.2): the reward model is BYPASSED entirely. The Bradley-Terry preference structure is folded into the policy loss directly. The reward function exists implicitly inside the policy itself.

In RLAIF: the reward model is replaced by a stronger LLM acting as a judge (e.g., Claude or GPT-4 scoring response pairs). Same Bradley-Terry math, different label source.

Reward model quality limits RLHF quality:

A reward model trained on 10K mediocre comparisons will have a noisy, biased preference function. Policy optimisation against a noisy reward will exploit the noise (reward hacking). Modern best practice: train RM on 100K+ high-quality comparisons; use ensembles of RMs to reduce single-model bias; validate the RM’s predictions against held-out human comparisons.

The PPO pipeline in production

Production RLHF training (one full cycle): while not converged: # 1. Sample Sample batch of prompts x ~ D_prompts. For each: generate response y ~ π_θ(· | x). ← rollout # 2. Score For each (x, y): compute r_full = r_φ(x, y). ← reward model forward pass # 3. Per-token reward with KL For each token t in response: r_t = (β · log[π_ref(y_t)/π_θ(y_t)]) ← per-token KL r_T = r_full at last token, summed with terminal # 4. Compute advantages (GAE) Run value network V_ψ(x, y_{<=t}) to estimate baseline. Advantage A_t = r_t + γ·V(t+1) - V(t) + GAE smoothing. # 5. PPO update For multiple epochs over the rollout batch: ratio = π_θ(y_t)/π_old(y_t) clipped = clip(ratio, 1-ε, 1+ε) · A_t loss = -mean(min(ratio · A_t, clipped)) + c_v · V_loss - c_h · H[π_θ]

Operational complexity:

— think, then check —

Component 1 — Clipped policy gradient (the main term):

Maximises E[A_t · log π_θ(y_t)] with clipping: ratio ρ_t = π_θ/π_old, clipped to [1-ε, 1+ε], loss = min(ρ_t · A_t, clip(ρ_t) · A_t).

Without it: vanilla policy gradient. Updates can be too large in one step, causing the policy to diverge. PPO clipping prevents huge per-step changes.

Component 2 — Value loss:

V_ψ is trained to predict the expected return from each state. Loss: (V_ψ - R_actual)². The value is used as a baseline in the advantage A = R - V — reducing variance of the policy gradient.

Without it: pure policy gradient with no baseline. Much higher gradient variance; training is noisy and slow. The value head is what makes PPO tractable.

Component 3 — Entropy bonus:

Adds -c_h · H[π_θ] to the loss (which subtracts -c · H, meaning ADDS c · H — encouraging higher entropy).

Without it: the policy can converge to deterministic (zero-entropy) very quickly, then can’t explore. The entropy bonus keeps the policy stochastic enough to discover better trajectories.

The KL penalty (token-level reward shaping):

r_token_t = … - β · log[π_θ(y_t)/π_ref(y_t)]

This is added to the per-token reward, NOT to the loss. It says: every time the policy diverges from the SFT reference, pay a per-token cost proportional to the log-ratio.

Without it: the policy can drift arbitrarily far from SFT, hacking the reward model. β=0 → uncontrolled exploration of weird policies.

Too large β: policy can’t move from SFT at all; RLHF has no effect.

Production tuning hurdles:

  1. β (KL coefficient): too small → reward hacking; too large → no learning. Typical 0.01-0.1, sometimes adaptive (KL controller).
  2. ε (PPO clip): typically 0.1-0.2. Too small → slow learning; too large → instability.
  3. Value loss coefficient c_v: 0.5-1.0. Determines how aggressively the value network is updated. Mismatched can cause training collapse.
  4. Entropy coefficient c_h: 0.001-0.01. Too small → premature determinism; too large → policy fails to commit.
  5. Reward scale: raw reward model outputs can be huge or tiny; PPO assumes well-scaled advantages. Usually normalise by running statistics.
  6. Sampling temperature: rollouts at temperature 0.7-1.0; too low → no exploration; too high → unrealistic distribution.
  7. Batch size / rollout length: trade-off between sample efficiency and gradient noise. RLHF papers use surprisingly large batches (millions of tokens per PPO step).

Why DPO replaced this for many use cases:

The PPO pipeline has ~7 hyperparameters and 4 networks; DPO has 1 hyperparameter and 2 networks. DPO is mathematically equivalent (under Bradley-Terry assumptions) and operationally vastly simpler. §18.2 derives the equivalence.

Next: §18.2 — Direct Preference Optimization (DPO). The closed-form solution to the PPO problem that lets you fine-tune on preference pairs DIRECTLY, with no reward model and no RL loop. Rafailov 2023.