Turning a next-token predictor into an assistant. Reward models, preference optimization, the modern simplifications.
A pretrained LLM predicts likely text. An assistant must answer USEFULLY, HELPFULLY, and SAFELY. Turning the former into the latter is "alignment", and the canonical pipeline since InstructGPT (Ouyang 2022) has three stages: supervised fine-tuning (SFT) on instruction-response pairs, a reward model trained on human preference comparisons (Bradley-Terry), then PPO with a KL constraint to the SFT reference. This section walks each stage with its math and its production reality.
Rafailov 2023 (DPO) made an elegant observation: the OPTIMAL policy under KL-constrained RLHF has a closed-form expression in terms of the reward function. Substituting that back into the Bradley-Terry reward-modeling loss eliminates the reward model AND the RL loop entirely. What remains is a simple supervised loss on preference pairs — no rollouts, no value head, no clipping, no KL term to track. DPO matches or beats PPO on most benchmarks and is now the default alignment method outside of frontier labs.
After DPO, the alignment landscape splintered. GRPO (DeepSeek 2024) keeps PPO‘s on-policy benefits but drops the value head — using group-relative advantages from K sampled outputs per prompt. RLAIF (Constitutional AI, Bai 2022) replaces human raters with a strong LLM as judge. Newer variants — SimPO, IPO, KTO, ORPO — each tweak the DPO loss for slightly different objectives. The 2025 picture is "DPO/GRPO at the bulk + custom losses for specific failure modes."
← ALL CHAPTERS