What works, what doesn't, and the open seams

Section 20.3

What works, what doesn’t, and the open seams

Eighteen months in, the picture is clearer. Reasoning models are SOTA on math, competition coding, formal theorem proving — domains where verifiers exist and reward signals are clean. They are NOT clearly better on subjective tasks (writing, judgment, social reasoning), and they introduce new failure modes that didn’t exist with standard LLMs (over-thinking simple questions, “reasoning hallucination”, expensive failures on under-specified problems). This section maps the real terrain — what’s genuinely better, what’s a wash, what got worse — and the open research seams worth tracking.

Where reasoning models genuinely win

Benchmark gains (reasoning model vs equivalent-size standard model, mid-2026): AIME 2024 (math competition): o3: 96.7% vs gpt-4o: 13.4% +83 pts USAMO (math olympiad): o3: 79.0% vs gpt-4o: 7.0% +72 pts LiveCodeBench (coding): o3: 90.4% vs gpt-4o: 32.1% +58 pts Codeforces (rating equivalent): o3: 2727 vs gpt-4o: 873 huge GPQA Diamond (PhD-level science): o3: 87.7% vs gpt-4o: 53.6% +34 pts MATH-500: o3: 99.0% vs gpt-4o: 76.6% +22 pts ARC-AGI: o3: 87.5% vs gpt-4o: ~30% +58 pts FrontierMath: o3: 25.2% vs gpt-4o: 2.0% +23 pts The gaps on competition math and coding are the LARGEST capability gains in benchmark history. For comparison, GPT-3 → GPT-4 gained ~20-30 points on similar benchmarks; o1 → o3 added another similar magnitude.

These are the headline numbers. They are real. Math, code, and formal reasoning have undergone a step-change in the past eighteen months that has no precedent in the field’s prior decade.

Where reasoning models are a wash (or worse)

Benchmarks where reasoning models DON'T help or hurt: EQbench (emotional intelligence): o3: 73.1 vs gpt-4o: 76.4 −3.3 Translation (FLORES-200): o3 ≈ gpt-4o (similar BLEU) Summarisation (XSum): o3 ≈ gpt-4o (similar ROUGE) Creative writing (LMSYS preference): gpt-4o > o3 for most subjective tasks MMLU-Pro humanities subsets: o3: ~85% vs gpt-4o: ~83% +2 Chat quality (Chatbot Arena ELO): gpt-4o ≈ o3 (no real win) For these tasks the reasoning model is roughly the same quality but: - Slower (3-30 seconds for thinking + sampling). - More expensive (10-100× per query). - Sometimes produces awkward output (over-thinks; cites the chain). For these domains, the standard non-reasoning model is the right tool.

The reasoning vs non-reasoning choice isn’t “always one or always the other” — it’s task-dependent. Production deployment often routes queries: simple/subjective to standard, hard/verifiable to reasoning. ChatGPT’s interface defaulting to o4-mini and offering “Think” as an opt-in reflects this.

New failure modes

Over-thinking reasoning failure mode A failure mode where a reasoning model expends extensive thinking compute on a question that has a trivial answer, often degrading the response quality vs a standard model. Example: asking 'what's 2+2' may trigger a 5000-token CoT exploring number theory before answering '4'. Caused by RL-trained behavior of 'always think hard' generalizing inappropriately. Modern reasoning models (o3, Claude 3.7) include budget-controllers that detect simple questions and skip CoT. : a reasoning model can spend 5000 tokens of thought on a question whose answer is “Paris.” The RL trained it to “always think hard”; this generalises poorly to easy questions. Modern reasoning models (o3, Claude 3.7) include budget-controllers that route simple questions away from the reasoning path.

Reasoning hallucination reasoning failure mode A failure mode where a reasoning model produces a confident, well-formatted chain-of-thought with a wrong premise or step that makes the final answer wrong but appears authoritative. Worse than standard hallucination because the long CoT makes the model's confidence LOOK justified. The user can be misled into trusting the answer because the reasoning 'seems careful.' Particularly dangerous for non-experts evaluating answers in specialised domains. : the model produces a confident, well-formatted CoT with a wrong premise. The chain LOOKS careful — bullet points, sub-conclusions, even a sanity check. But step 3 made an arithmetic error and step 6 cited a non-existent paper, and the final answer is wrong. This is MORE dangerous than standard hallucination because the chain makes the model’s confidence look justified.

Other failure modes: - Sensitive to prompt structure. Reasoning models can over-optimise for the format they were RL'd on. Slightly different prompt format can dramatically degrade performance. - Long-horizon planning still fails. A 50-step plan with dependencies often has at least one step wrong; the rest cascades. Reasoning doesn't extend reliably past 20-30 dependent steps. - Tool use is brittle. Reasoning models call tools (search, calculator, code interpreter) but often fail to integrate tool outputs back into their reasoning. Result: tool calls become a checklist rather than useful information. - Sycophancy in non-verifier domains. Without verifier signal, the model defaults to agreeing with user assertions. RLHF would have caught this; pure RL on math/code didn't.

— think, then check —

Likely causes:

1. Over-thinking on simple questions. “What’s your return policy?” doesn’t need 8 seconds of thinking; users expect ~1 second response. The reasoning model’s CoT delay degrades perceived quality regardless of answer quality.

2. Reasoning hallucination in policy questions. The model “carefully reasons” through policy decisions that should just look up the company’s documented policy. CoT structure makes wrong answers LOOK more authoritative.

3. Format mismatch. Reasoning models output CoT visible to users by default. Customer support expects direct answers, not a chain showing “let me think about this…“

4. Cost. 10-100× per-query cost on customer support volume can be prohibitive. The team may have to throttle the model to keep budget under control, which degrades availability.

5. Tone mismatch. Reasoning models can sound cold and analytical. Customer support benefits from warmth and empathy — qualities a math-RL’d model isn’t trained on.

6. No verifier signal for the actual task. Customer support has no clean verifier. The model’s RL training doesn’t help; you’re just paying for slower inference.

What they should do differently:

Use a standard non-reasoning model as the primary handler.
Reserve the reasoning model for cases that genuinely benefit: math/billing calculations, technical troubleshooting with clear right answers.
Route based on query type. Easy questions → standard. Hard technical → reasoning.
Hide the reasoning chain from the customer; just show the answer.
Consider fine-tuning a smaller model on customer-support-specific data instead — likely better fit than a generic reasoning model.

The general lesson: reasoning models are a tool for hard verifiable problems. Deploying them for everything is over-engineering and counterproductive.

↳ §20.3 + observed deployment issues

The inference economics of reasoning models

Cost-per-correct-answer comparison (rough, mid-2026): Task: AIME-level math problem (avg ~3 minutes of human work) Model Cost/query pass@1 Cost-per-correct gpt-4o $0.01 13% $0.077 o3 (high mode) $1.20 97% $1.24 o3-pro $10.00 99.5% $10.05 For this task: gpt-4o is CHEAPER per correct answer ($0.077 vs $1.24). But: 87% of the time gpt-4o is wrong, and you don't know which 13%. o3's near-100% accuracy means you can TRUST the output without checking. For tasks where checking is expensive (or impossible): pay for reasoning. For tasks where checking is cheap or you can tolerate errors: pay for cheap. Task: simple factual question (avg ~10 seconds of human work) Model Cost/query pass@1 Cost-per-correct gpt-4o $0.01 85% $0.012 o3 (high mode) $1.20 95% $1.26 Here: gpt-4o WINS on cost-per-correct (~100× cheaper). The 10% gain in accuracy doesn't justify the cost.

— think, then check —

Tax-form preparation profile:

Task is computational (numbers, rules, look-up).
Errors have HIGH cost — wrong tax form → user audited → product reputation destroyed.
Verifier exists partially: arithmetic can be checked; rule application can be cross-referenced with tax code.
Per-form margin is probably $5-50 depending on tier.
User-facing latency: ~30 seconds is acceptable for a tax preparation tool.

Reasoning model fit:

Strong fit. The combination of (a) verifiable computations, (b) catastrophic cost of errors, (c) tolerable latency, (d) high margin all favor reasoning.

Cost-per-correct math:

Standard model (gpt-4o): $0.10/form × ~80% accuracy = $0.13 per correct form.
Reasoning model (o3): $2.00/form × ~98% accuracy = $2.04 per correct form.

Raw cost-per-correct favors standard. But:

Errors in tax filing CAUSE $1000s in penalties (user-perceived cost).
Each error damages brand trust and triggers customer support cost.
The marginal $2 to halve errors is overwhelmingly worth it for a product where errors are catastrophic.

Recommended architecture:

Standard model for routine extraction (most fields are clear).
Reasoning model for “should this deduction apply?” judgment calls and arithmetic verification.
Reasoning model for the FINAL CHECK: re-derive the totals and flag any inconsistency.
Human review for any flagged inconsistency or edge case.

This hybrid uses the cheaper standard model for the 80% of work that’s mechanical, and the expensive reasoning model where it matters most.

Cost: ~$0.50/form. Effective accuracy: ~99%+. Catastrophic-error rate: very low. Margin: viable.

↳ §20.3 + production cost modeling

Open research seams

Where the genuine open problems are (mid-2026): 1. Verifiers for subjective domains. How do you train reasoning models on "is this essay good?" or "is this product recommendation appropriate?" Process reward models help but are limited by labeling cost. Self-evaluation by the model is unreliable. Lightman 2023 → Hosseini 2024 → ??? 2. Transfer from verifiable to non-verifiable. Does RL on math/code make the model better at general reasoning? Empirically: some transfer (~5-10 percentage point lift on GPQA), not full transfer. Understanding the mechanism is open research. 3. Inference-time test-time compute scaling beyond pass@N. Best-of-N saturates at ~99%. Tree-of-thoughts, Monte Carlo tree search over reasoning steps, self-consistency variants — all explored, none dominantly better. Open seam. 4. Reasoning model alignment. The standard alignment techniques (RLHF, DPO) are about chat quality. How do you align a reasoning model? Reasoning specifically is capability without much safety training. Some models have shown problematic reasoning chains that the final answer hides. 5. Reasoning at scale. Most reasoning model work has been on math/code. Can it scale to multi-step agentic tasks (planning a research project, writing a paper end-to-end)? Currently limited to ~10-20 step reasoning; breaking past this is the agent reliability problem (Ch.27 §3). 6. Cost reduction for reasoning inference. Current reasoning is 10-100× more expensive than standard. Can we get the quality at lower cost? Distillation, smaller-model reasoning, shared CoT prefix caching all help. Open seam.

— think, then check —

Most impactful AND accessible to a small team:

(3) Inference-time test-time compute scaling beyond pass@N.

This is doable on rented compute. Take an open reasoning model (R1, Qwen 3.5 thinking), implement different inference strategies (tree search, self-consistency variants, learned aggregators over multiple CoTs), measure scaling curves. The compute requirement is modest — you’re not training, just inferring.

The impact is high because better TTC scaling translates directly to lower cost per correct answer in production. A 2× efficiency improvement on TTC is worth tens of millions to inference providers.

Related: (6) cost reduction for reasoning inference. Same skill set; similarly accessible.

Most impactful but frontier-scale only:

(1) Verifiers for subjective domains.

This requires both massive labeling investment (PRMs need expert human labels for thousands of complex tasks) and large-scale RL training to actually evaluate whether the verifier improves model behavior. Frontier labs are the only ones with both the labeling budget and the compute.

If solved, this UNLOCKS reasoning model improvements for subjective tasks (writing, judgment, social reasoning) — which is most of what humans value in LLMs. This is the most impactful seam by far, but it’s not accessible to a small team.

Adjacent: (4) reasoning model alignment. Frontier labs are best positioned here too. Requires understanding the model deeply, having broad red-teaming infrastructure, and the ability to iterate on alignment techniques at scale.

For a small team, focus on:

TTC scaling research (problem 3).
Inference cost reduction (problem 6).
Open evaluation benchmarks for reasoning (related to problem 2).
Reasoning model applications in specific domains (medicine, law) where domain knowledge is the bottleneck.
Tool integration for reasoning models — making them better at incorporating tool outputs back into their reasoning.

Avoid trying to compete with frontier labs on:

Training larger reasoning models from scratch.
Building reward models at scale.
Alignment for reasoning models broadly.

These need frontier infrastructure. Better to focus where small-team productivity is competitive.

↳ §20.3 + 2026 research landscape

The bigger trajectory

This is the most significant economic shift in LLM deployment since the original transformer release. The book’s pretraining chapter (Ch.16) was written from the Chinchilla perspective: compute-optimal training, fixed inference cost. That perspective remains correct for standard models. For reasoning models, the economics flip: per-query cost varies 100×, training matters less, inference matters more.

For a senior engineer transitioning into the field in 2026: this is the chapter to internalise. Reasoning models are the new substrate of premium LLM applications. Understanding HOW they’re trained (verifiable rewards), WHY they work (test-time compute scaling), and WHERE they fail (subjective tasks, over-thinking) is the foundation for shipping anything more sophisticated than chat in 2026 and beyond.

END OF CH.20 — Reasoning models.
§1 (test-time compute scaling: log-linear in B, verifier vs no-verifier asymmetry) · §2 (the DeepSeek R1 recipe: SFT → GRPO with verifiable rewards → distillation; R1-Zero emergence) · §3 (what works, what doesn’t, the open seams).

Next: Ch.21 — Beyond transformers. SSMs, Mamba, hybrid architectures, honest assessment of where alternative architectures stand. The book’s other “what comes next” chapter.