REASONING MODELS
Section 20.1
01

The shift — sample more vs think more

For ten years the only knob anyone could turn for “more LLM quality” was TRAINING — more parameters, more data, better recipe. Inference cost stayed essentially fixed: one forward pass per generated token, decode latency ~30 ms per token, done. OpenAI o1 (September 2024) and DeepSeek R1 (January 2025) cracked open a second knob: test-time compute. Spend 10-100× more inference per question — model writes a long chain-of-thought, samples multiple solutions, checks them — and quality reliably goes up. Math accuracy jumped from ~50% to ~95% on AIME. Coding accuracy jumped from ~25% to ~75% on Codeforces-hard. Per-query cost went 30-100×; for the kind of problems users were willing to pay $0.10/query for, that was fine. This section walks the shift, the scaling-law form, and the trade-off that decides whether it works for a given task.

Two axes of compute

Pre-2024 mental model (Chinchilla era): total cost = TRAINING cost + inference cost × (lifetime queries) TRAINING cost: scale via (N, D) — bigger model, more tokens. INFERENCE cost: ~constant per generated token. Quality tied to N. 2024+ mental model (reasoning era): total cost = TRAINING cost + (per-query inference cost) × queries TRAINING cost: now includes RL on verifiable rewards (next section). INFERENCE cost: VARIABLE. The model can "think" for 100 tokens or 100,000. Per-query cost scales with how hard you push it. Quality tied to BOTH N (model size) and B (thinking budget).

Test-time compute (TTC) is the new lever. It exists because some problems are easier to VERIFY than to SOLVE (Bryson’s lemma in computational complexity: verification can be polynomial while solution requires exponential search). Math, code, formal proofs — the verifier is cheap, so sampling many attempts and keeping the right one converges to high accuracy at the cost of inference compute.

The scaling law shape

Empirical test-time compute scaling law (rough): accuracy(B) = E + A · log(B + 1) up to a task-dependent ceiling where B is the inference compute budget (tokens of thinking + parallel samples). For math (AIME 2024) on a 32B-class reasoning model: B = 1× standard: ~50% accuracy B = 10×: ~75% accuracy B = 100×: ~92% accuracy B = 1000×: ~96% accuracy (ceiling) For coding (LiveCodeBench): B = 1×: ~28% pass@1 B = 10×: ~58% B = 100×: ~75% The curve is LOG-LINEAR in B until you hit the model's capability ceiling. Adding TTC past the ceiling does nothing — the model genuinely can't solve problems beyond its base capability, no matter how many samples.

This is the “reasoning era” version of the Chinchilla scaling law. The economics are completely different: instead of paying $10M to train a slightly better model, you pay 10× more per query at inference. Whether that’s a win depends on (a) how many queries the model serves over its lifetime, and (b) whether your problem has a checkable answer.

Now make it run

The kernel from this section simulates the test-time compute trade-off on a synthetic multiple-choice task. Each problem has a known correct answer; the model’s per-attempt accuracy varies by difficulty. We sweep N (number of attempts) and measure pass@N (oracle verifier) vs majority@N (no verifier, consensus voting):

test_time_compute.c — main C · best-of-N with and without verifier
       +1.5 ≈ shifts a 30% problem to ~65% per attempt. */
    const float delta_cot = 1.5f;

    for (int i = 0; i < N_PROBLEMS; i++) {
        correct[i] = (int)(urand() * N_ANSWERS);
        /* Difficulty: U(0.05, 0.45) base accuracy. */
        p_base[i]   = 0.05f + 0.40f * urand();
        p_reason[i] = sigmoidf(logitf(p_base[i]) + delta_cot);
    }

    /* Sweep N from 1 to MAX_N. */
    int Ns[] = { 1, 2, 4, 8, 16, 32, 64 };

    printf("Test-time compute on a verifiable task (N=%d problems, %d answer choices).\n",
           N_PROBLEMS, N_ANSWERS);
    printf("Per-attempt accuracy: base ∼ U(0.05, 0.45);  reasoning = sigmoid(logit(base) + %.2f).\n\n",
           delta_cot);
    printf("%-4s %-12s %-14s %-14s %-14s %-14s\n",
           "N", "compute×", "base pass@1", "base pass@N", "reason pass@N", "reason maj@N");

    for (size_t ni = 0; ni < sizeof(Ns)/sizeof(Ns[0]); ni++) {
        int N = Ns[ni];

        long any_base = 0, any_reason = 0, majority_reason = 0, pass1 = 0;
        int counts[N_ANSWERS];

        for (int i = 0; i < N_PROBLEMS; i++) {
            int base_first   = sample_attempt(correct[i], p_base[i]);
            if (base_first == correct[i]) pass1++;

            /* Best-of-N for base. */
            int hit_base = 0;
            for (int k = 0; k < N; k++) {
                if (sample_attempt(correct[i], p_base[i]) == correct[i]) { hit_base = 1; break; }
            }
            if (hit_base) any_base++;

            /* Best-of-N + majority@N for reasoning. */
            int hit_reason = 0;
            for (int j = 0; j < N_ANSWERS; j++) counts[j] = 0;
            for (int k = 0; k < N; k++) {
                int a = sample_attempt(correct[i], p_reason[i]);
                counts[a]++;
                if (a == correct[i]) hit_reason = 1;
            }
            if (hit_reason) any_reason++;

            /* Majority vote: pick the most-frequent answer. */
            int best = 0;
            for (int j = 1; j < N_ANSWERS; j++) if (counts[j] > counts[best]) best = j;
            if (best == correct[i]) majority_reason++;
        }

        float p1     = (float)pass1            / N_PROBLEMS;
        float bsN    = (float)any_base         / N_PROBLEMS;
        float rsN    = (float)any_reason       / N_PROBLEMS;
        float rmaj   = (float)majority_reason  / N_PROBLEMS;

        printf("%-4d %-12d %-14.3f %-14.3f %-14.3f %-14.3f\n",
               N, N, p1, bsN, rsN, rmaj);
    }

Output:

N    compute×    base pass@1    base pass@N    reason pass@N  reason maj@N
1    1           0.240          0.256          0.572          0.572
2    2           0.242          0.428          0.810          0.620
4    4           0.254          0.658          0.932          0.696
8    8           0.292          0.792          0.986          0.810
16   16          0.276          0.944          0.998          0.858
32   32          0.272          0.994          1.000          0.896
64   64          0.230          0.998          1.000          0.936

Read the columns:

This is the test-time-compute scaling law made concrete. The gap between pass@N and majority@N is the gap between “can verify” and “can’t verify” — math and code can verify, so they hit the ceiling fast. Subjective tasks can only consensus-vote, so they plateau lower.

— think, then check —

The form: at compute budget B (tokens of thinking + parallel samples), the accuracy grows logarithmically with B until it hits a ceiling determined by the model’s intrinsic capability on the task.

E (the floor):

The baseline pass@1 — what the model gets with B = 1, no thinking, single sample. For a 7B model on AIME: ~10%. For a 32B reasoning model: ~50%. The floor depends on how much the model has internalised the relevant capability during training. It’s NOT zero because the model has some knowledge.

A (the slope):

How much each doubling of B buys you in accuracy. Empirically ~5-8 percentage points per doubling for verifier-friendly tasks. The slope depends on:

  • How peaked the model’s distribution is (if it’s already confident, more samples don’t explore much).
  • Whether the verifier is hard-pass (math: exact answer match) or soft (subjective).
  • The diversity of generation (temperature, prompt variants, CoT length).

Higher A = more “explorable” task. Math has very high A (lots of attempts get diverse partial-progress); pure recall tasks have low A (the model either knows it or doesn’t).

Where the curve breaks:

Three failure modes break the log-linear scaling:

  1. Capability ceiling. Some problems are genuinely beyond the model’s reasoning ability. No matter how many samples, none of them solves it. The curve saturates below 100%.
  2. Mode collapse. If the model has been heavily RL-trained, its sampling distribution narrows. Diversity is lost; more samples produce the same wrong answer.
  3. Verifier limitations. Real verifiers have bugs — math graders accept “almost right” answers; code unit tests miss edge cases. The apparent accuracy from the verifier exceeds the true accuracy.

Empirically, models like o1 / R1 reach ~90-95% on competition math by B = 100x; pushing to 1000x buys only a few more percentage points because the remaining problems hit the capability ceiling.

Why this changes the economics

For deployed reasoning models in 2026:

This is a fundamentally new business model for LLM APIs. The same model serves three price points by varying the inference budget. The architecture didn’t change; what changed is HOW you spend compute at inference.

— think, then check —

(a) Marketing email:

Does NOT pay off. No verifier exists for “is this marketing email good.” The model can spend 10× compute writing 5 variants, but without a way to check which one is better, you’ve just paid more for the same expected quality (you’d have to read all 5 anyway).

Better strategy: just use standard tier; iterate on prompts manually.

(b) Calculus problem:

DOES pay off. The answer is verifiable (the numeric result either matches or doesn’t). 10× compute means the model can attempt the problem 10 different ways, or think through 10× longer chain-of-thought. The verifier eliminates wrong attempts; the right answer rises to the top.

Math accuracy at the reasoning tier is typically 30-50 percentage points higher than at the standard tier. Worth every penny if the problem is hard.

(c) Debugging a 500-line script:

Probably pays off, with caveats. There’s a partial verifier (the script runs or it doesn’t; tests pass or fail). Reasoning tier can: (i) try multiple fix candidates, (ii) trace the execution mentally, (iii) maintain longer working context.

The catch: if the bug is in test infrastructure or “the spec is wrong” territory, the verifier signals are misleading. The reasoning model may confidently “fix” something that wasn’t broken. Net positive but less reliable than math.

The rule:

Reasoning-tier compute pays off when (a) the answer is checkable, (b) the model’s capability ceiling on this problem class is well below 100% at the standard tier, and (c) the cost of getting the WRONG answer is much higher than the marginal compute cost.

For subjective work (writing, judgment, recommendation), reasoning tier is mostly a waste. For computational work with verifiers, it’s the obvious choice.

What the production reasoning models actually do

A rough survey of what’s shipping in mid-2026:

Model Mode Budget Cost vs std Use OpenAI o3 "high" 100s of CoT ~30× default for math/science OpenAI o3-pro best-of-N + tree 1000s × N ~200× research, hard problems DeepSeek R1 standard ~200 CoT tokens ~5× open-weight, 8B-class Claude 3.7 (thinking) extended thinking tunable 1-128K ~5-50× tunable budget per query Gemini 2.5 Pro Deep tool-using search multiple rounds ~100× research with web search Common substrate: - Base model: ≥ 30B params (smaller models don't have enough capability ceiling). - SFT on long CoT data (next section). - RL on verifiable rewards (next section). - Inference: parallel best-of-N + sequential CoT + occasional tool use.
— think, then check —

The structural difference:

For verifier-friendly tasks, you can ASYMMETRICALLY filter: cheaply check N candidates, keep the verified-correct one. The cost of one more attempt is small; the value of one correct attempt is large. Best-of-N converges to nearly-100% accuracy when even one of N attempts succeeds.

For non-verifier tasks, there’s no asymmetry. The N candidates are all “plausible essays” or “plausible judgments” — none of them is verifiably correct or wrong. To pick the best, you either:

  • Have a human compare them (expensive, doesn’t scale).
  • Use an LLM judge (the judge has the same biases as the writer — circular).
  • Use a learned reward model (only as good as the data it was trained on, often biased).

None of these give you the same convergence properties.

The math:

For verifier-friendly: P(at least one of N correct) = 1 − (1 − p)^N → 1 as N → ∞. Geometric convergence; even small p (e.g., 10% per attempt) reaches 99% accuracy in ~44 attempts.

For non-verifier with majority voting: P(majority correct) ≈ probability that the model’s most common output IS correct. If the model is confidently wrong (gives the same wrong answer consistently), majority voting INCREASES errors with N. The curve can be inverted.

The implication for product:

Reasoning models EXCEL at: math, formal logic, code with tests, structured optimisation, factual lookup with web search verification, formal verification (Lean, Coq).

Reasoning models ARE NOT BETTER at: chat, summarisation, creative writing, judgment calls, taste, social reasoning, anything without a ground truth.

Mid-2026 reality: o3 and Claude Extended Thinking are SOTA on math/coding benchmarks but barely beat gpt-4o on EQbench (emotional intelligence) or MMLU-Pro’s subjective subsets. The gap is structural, not a temporary engineering deficit.

What might change this:

Better learned verifiers for subjective tasks — “process reward models” that score reasoning STEPS rather than outcomes. Lightman 2023 (Let’s Verify Step by Step) showed this works for math; extending to “is this essay good” is open research.

For now: reasoning models are a tool for verifier-friendly tasks. Non-verifier tasks stay on the standard inference path.

Next: §20.2 — How reasoning models are actually trained. The DeepSeek R1 recipe in detail: cold-start SFT → GRPO with rule-based rewards → rejection sampling → distillation. Why GRPO (Ch.18 §3) was the right RL algorithm for this. And why R1-Zero — pure RL from base, no SFT — worked but produced quirky outputs.