Open problems + the close

Section 27.3

Open problems + the close

This is the last section of the book. We’ve covered the math (Parts I-II), the architecture (Part III), the LLM lifecycle (Part IV), the systems (Part V), and the alternatives (Ch.20, Ch.26). What’s left is to step back and ask: where do things go from here? What are the open problems worth working on? And — circling back to the conversation that started this book — what should a senior systems engineer arriving in 2026 take from all this? This section is the candid answer.

Where the genuine open problems are (2026)

After two years of frontier model releases (GPT-4, GPT-5, Claude 3/4, Gemini 1.5/2, Llama 3/4, DeepSeek V3/R1), the field has matured. Many problems that seemed open in 2023 are now well-understood. Others remain genuinely hard.

Genuine open problems (2026): 1. ALIGNMENT AT SCALE. Current alignment techniques (RLHF, DPO, Constitutional AI) work on 70B-class models. As models grow toward 1T+ parameters and acquire increasingly general capabilities, the alignment techniques may not scale linearly. Open: how do you align a model that's smarter than you in many dimensions? 2. LONG-CONTEXT REASONING. Models can NOW handle 1M-token contexts technically. But REASONING over 1M tokens — multi-hop, requiring all the information to be integrated — still fails. The KV cache holds everything but the model doesn't always USE it. Open: how to ensure the model genuinely reasons over the full context, not just retrieves snippets. 3. ON-DEVICE FRONTIER INFERENCE. Apple Silicon and similar can run 7-13B models. Frontier (200B+) models require server hardware. Open: how to compress/distill frontier models to run on consumer hardware without losing capability. Targeted distillation, hybrid approaches, on-device adapters. 4. TRAINING EFFICIENCY. Pre-training a 70B-class model still costs $5-50M. Most of the compute is bottlenecked by communication and memory bandwidth, not compute. Open: better algorithms that match the hardware bandwidth/compute ratio. ZeRO++, custom optimisers, better mixed-precision recipes. 5. MECHANISTIC INTERPRETABILITY. We can train massive models but barely understand what they're doing internally. Sparse autoencoders, feature direction analysis, circuit identification all make slow progress. Open: how to scale interpretability to frontier models. Anthropic's work is the most advanced; still many years from comprehensive understanding. 6. AGENT RELIABILITY. Multi-step agents (Claude Code, OpenAI's Operator, similar) work well for simple tasks but fail at complex multi-step ones. Failure modes include hallucinated tool calls, context drift, off-task wandering. Open: how to make agents reliable enough for production business processes. 7. EVALUATION. Benchmarks are increasingly gamed. MMLU saturated; LiveBench, etc. slowly slip too. We don't have reliable ways to measure "is this model better at <task> than the previous one?" Open: how to evaluate models in ways that resist contamination and capture real capability. 8. SAFETY AT DEPLOYMENT. Models can be tricked, jailbroken, used for harmful purposes. Defense techniques (constitutional AI, safety classifiers) help but don't solve the underlying problem. Open: deeper alignment that's robust to adversarial users. 9. NEW ARCHITECTURES THAT ACTUALLY SHIP. SSMs, hybrids, novel attention variants all show promise. None has actually displaced attention at the frontier. Open: a credible non-transformer that wins the frontier in commercial deployment. 10. MULTIMODALITY DONE RIGHT. Vision-language models work but feel "bolted on" to text models. Audio, video, embodied are even further behind. Open: a unified architecture that handles modalities natively, not as text plus encoder.

— think, then check —

Realistic for a small team (no large GPU cluster):

1. Evaluation methodology (#7). Building better benchmarks, evaluation frameworks, or contamination-resistant tests doesn’t require huge compute. Can be highly impactful and well-suited to a small focused team.

2. Agent reliability (#6). Mostly about prompt engineering, tool design, error handling. A 5-person team can build agents and study failure modes systematically. Lots of unsolved problems.

3. On-device inference (#3). Quantization, distillation, kernel optimization. Can be done with consumer hardware. Real-world utility (run on phones, laptops). High demand.

4. Long-context reasoning research (#2 partial). If you use APIs (not training), you can study how existing frontier models handle long contexts. Build evals, probe failure modes, propose fixes via prompting/scaffolding.

5. Interpretability (#5 partial). Mechanistic interp on smaller models (7B-class) is accessible. Sparse autoencoders, circuit discovery, etc. Anthropic’s work is the leader; a small team can extend it.

6. Multimodality applications (#10 partial). Building specialized vision-language or audio applications doesn’t require frontier-scale models. Lots of unsolved domain-specific problems.

NOT realistic for a small team:

1. Alignment at scale (#1). Requires training and red-teaming frontier-class models. Resource-intensive.

2. Training efficiency (#4). Requires running large training experiments. Possible to study small-scale and extrapolate, but limited impact.

3. Safety at deployment (#8). Requires access to deployed model traffic, large-scale evaluation infrastructure.

4. New architectures (#9). Without compute to train at scale, hard to show your architecture wins.

Recommended focus:

For a 5-person team without a cluster: specialise in one of the application-or-tooling areas:

Build a specialized agent framework for a vertical (legal research, medical info, code review).
Build evaluation tools for a specific capability that the field doesn’t have good metrics for.
Build a high-quality on-device deployment for a specific use case.
Contribute to open-source interpretability tools (sparse autoencoders for 7B models).

The pattern: pick problems where compute isn’t the bottleneck — your team’s reasoning, engineering taste, and product judgment are. The “infrastructure-heavy” problems are best left to teams with infrastructure.

↳ §27.3 + 2026 research landscape

What to work on, by career stage

If you're a NEW ML engineer (0-3 years): - Learn one frontier model deeply. Read all the papers. Run the code. - Build something useful (an application, a tool, a benchmark). - Specialise: pick one area (inference, RAG, alignment, etc.). - Contribute to open source. Maintain a project that others use. - Join a smaller team where you can have outsized impact. If you're a MID-CAREER engineer (3-10 years): - Pivot. Your existing skills (systems, distributed computing, math) are a foundation. Add specifics: CUDA, transformer architecture, scaling laws. - Choose between: research path (frontier lab), product path (build LLM products), or systems path (inference/training infra). - Pick a meaningful real-world problem to work on. Build deep expertise. - Network with researchers. Read papers actively. Develop opinions. If you're a SENIOR engineer (10+ years): - Your judgment is the asset. Apply it to the open problems. - Lead a team or initiative; mentor newer hires. - Build domain expertise. The intersection of "ML + your specific domain knowledge" is where leverage is. - Frontier labs are competitive for senior hires but rewards are correspondingly high. If you're a PHD STUDENT: - Frontier labs have research teams hiring PhDs constantly. - Publish, especially in well-targeted venues. - Build a track record of REAL contributions, not just papers. - Open-source maintenance is highly valued. If you're TRANSITIONING from another field: - The math and systems foundations transfer; the specifics don't. - Plan 12-18 months of focused learning to get to "professionally useful." - Build something. Don't just read; ship code that someone can use. - Specialise rather than trying to know everything.

The close — what this book was trying to do

This book started with a conversation: an experienced systems engineer (“math 20 years ago”) wanted to come up to speed on modern ML — specifically the systems half — without going through a 4-year academic program. The premise was that the math and engineering foundations are still the same (linear algebra, calculus, optimisation, SIMD, memory hierarchies). What’s new is the specific application: transformers, attention, MoE, alignment, the LLM systems stack.

What we built up:

Part I-II (Ch.1-10): the math and the build-then-execute foundation. Linear algebra, calculus, probability, SIMD, ScaLAR optimisation, backprop, autograd. The foundation that doesn’t change regardless of architecture.
Part III (Ch.11-14): the transformer fully assembled. Embeddings, attention with FlashAttention, normalisation, residual streams. The architecture that dominates 2024-2026.
Part IV (Ch.15-20): what makes an LLM. The decoder-only stack, pretraining, MoE, alignment, fine-tuning, alternatives (SSMs, hybrids). The lifecycle of a model from training to deployment.
Part V (Ch.21-25): the systems. Hardware substrate, runtimes, inference, training at scale, quantization. The infrastructure that turns the math into something that runs.
Part VI (Ch.26-27): the frontier. Vector search (closing the original conversation), reading research, where the field is going.

The structural insight underneath: modern ML is mostly systems engineering applied to a well-understood mathematical core. The math (gradient descent, attention) hasn’t changed much since 2017. What’s changed is the systems sophistication: distributed training, memory hierarchies, quantization formats, inference engines. A senior systems engineer can become productive in this field much faster than they might think, because the systems half is already familiar.

What you should take from this

Three things, by priority: 1. THE MATH MATTERS. You can't really understand attention without understanding what a dot product is doing in high dimensions. Can't understand quantization without understanding float anatomy. Can't understand FSDP without understanding communication primitives. The book worked these out from first principles. Refer back when needed. 2. THE SYSTEMS HALF IS MOSTLY APPLIED COMPUTER SCIENCE. Memory hierarchies, communication primitives, distributed systems, compiler optimization — none of this is specifically "ML." It's the classical systems toolkit, applied to a specific problem. Your systems expertise transfers. 3. KEEP LEARNING. The field moves fast — every 6 months brings significant changes. The way to stay current isn't to read every paper (impossible) but to find the 10-20 papers per year that genuinely matter and read them deeply. Follow the people whose taste you trust. Build things. The book is a snapshot of where things were in 2026. The math will still be relevant in 2036. The specific architectures (transformers) may not be. But the way of THINKING about them — the math + systems + skeptical reading — that's what travels.

Bringing this all the way back

The book opened with a question: “what’s the intuition behind TurboQuant — and llama.cpp’s tiling — and how do they all connect?”

We’ve now answered: rotation preserves distances (Ch.2 §3); random projections preserve geometry (Ch.7); attention is the matmul-softmax-matmul pattern (Ch.13); FlashAttention tiles + online softmax to keep operations in SRAM (Ch.13 §3); quantization compresses weights (Ch.25); HNSW + PQ + RaBitQ apply similar compression principles to vector search (Ch.26).

The connections are real. Rotation-based quantization (RaBitQ) for vector search uses the SAME mathematical structure as tile-based attention (FlashAttention) and the SAME bandwidth-vs-compute trade-off as quantization (q4_K_M). Once you understand the principles, the connections light up across the field.

The book is now done.

— think, then check —

Month 1-2: Build foundational projects.

Reimplement a core transformer from scratch. nanoGPT-style: read the Karpathy lectures, build it yourself in PyTorch. Train on a small dataset. Understand every layer.
Implement FlashAttention from the paper. Either as a CUDA kernel or in Triton. Verify against PyTorch’s standard attention.
Reproduce a small published result. E.g., fine-tune Llama 3 8B on a specific task using LoRA. Document the process. Compare your numbers to the paper.

Output: 3 GitHub repos showing technical depth.

Month 3-4: Specialize and build a portfolio piece.

Pick a specialization. Either: inference optimization (vLLM-style), pretraining infrastructure, alignment/RLHF, or applied LLM products.
Build a portfolio piece. Something that demonstrates real competence. Examples: a new GGUF quantization implementation, a novel benchmark for a capability the field doesn’t measure well, a working agent system for a specific task, or an LLM-powered tool that solves a real problem.
Write a blog post about it. Documentation is half the value.

Output: 1 substantial portfolio piece with code + writing.

Month 5: Network and contribute.

Contribute to a high-profile open source project. vLLM, llama.cpp, transformers, lm-eval-harness. Even small PRs build reputation.
Attend a conference. NeurIPS, ICML, MLSys, or smaller workshops. Meet people. Have informed conversations.
Engage on Twitter/X. Follow researchers, share thoughtful work, comment substantively. This is where much of ML hiring intersects with reputation.

Output: a network of 10-20 people who know your work.

Month 6: Job search.

Apply broadly. Don’t aim ONLY for OpenAI/Anthropic. Mid-tier labs (Cohere, Mistral, AI2, MIT-Hopper, smaller groups at Meta/Google/Amazon) have great opportunities and faster interview processes.
Use your portfolio. Reference your projects in cover letters. Make your GitHub the headline of your resume.
Mock interview. ML interviews are different from generic engineering. Practice paper discussions, system design with ML twist, coding-with-PyTorch.
Negotiate. Senior engineers transitioning into ML often UNDERPAY for their experience. ML salaries are high; don’t sell yourself short.

Output: 2-3 offer letters.

The realistic outcome:

If executed well: a strong senior systems engineer can land a productive role at a mid-to-top ML lab within 6-9 months of focused work. The combination of systems expertise + applied ML knowledge is valuable; the field hasn’t yet caught up to the demand.

If you don’t have the patience for 6 months of building before applying: start with smaller roles (DevOps for ML teams, applied engineering at an LLM-using company) and grow from there. Network and build reputation; opportunities follow.

The trajectory is real. ML doesn’t require a 4-year CS PhD to be productive; it requires solid foundations + recent specific knowledge + a portfolio of work. This book gave you the first two; the third is on you.

↳ §27.3 + career stage advice

— think, then check —

The central thesis:

Modern ML, despite its mystique, is mostly systems engineering applied to well-understood math. A senior engineer with systems background and willingness to learn the specific math can become productive at the frontier of ML in 6-18 months. The “ML transition” isn’t a 4-year apprenticeship; it’s a focused study of a specific domain.

The supporting arguments:

The math is mostly classical. Linear algebra, calculus, probability, optimisation. None of it has fundamentally changed. The applications to neural nets and transformers are specific but they BUILD on classical foundations.
The systems half is applied computer science. Memory hierarchies, distributed primitives, compilers, quantization, profiling. These are all classical CS topics applied to a new domain.
The architectures are simpler than they look. A transformer is conceptually: embed → (attention + FFN) × L → unembed. Each piece is a few weeks of study at most. The complexity is in scale and tuning, not in fundamental ideas.
The systems engineering is what scales it. FlashAttention, ZeRO, PagedAttention, quantization — these are all systems contributions that turned theoretical models into things that train and serve at scale.
Reading research and staying current is a skill. The field publishes too much; you must learn to filter. Once you can read efficiently, you can stay current with relatively modest investment.

The one key insight:

YOU ALREADY KNOW MORE THAN YOU THINK.

If you have systems engineering background — debugging memory bugs, optimising distributed systems, working with hardware constraints — you have the foundational toolkit. Transformer architectures are a specific application of these skills, not a separate discipline.

The mystery of ML is largely manufactured. The papers are hard to read because of jargon and notation, not because the ideas are inherently complex. Strip away the formalisms, and most “modern ML” insights are: “we found a slightly better way to organise computation in a setting where memory and compute are imbalanced.”

Once you see this, the field becomes much more accessible. You stop being intimidated by the architecture papers. You start seeing the systems patterns. You can read FlashAttention 3 and recognise that it’s “online softmax + GPU tiling + dataflow optimisation.” You can read Chinchilla and recognise that it’s “scaling-law fitting to find the optimal compute allocation.”

The book’s goal was to build you up to this point. The math from first principles. The systems from existing knowledge. The architectures as applications. The current state of the art as a snapshot.

What you do with this knowledge is up to you.

You can build LLM products. You can join a frontier lab. You can write papers. You can contribute to open source. You can teach others.

What you cannot do anymore: claim that ML is some inaccessible art reserved for people with academic ML pedigrees. The field is younger than you think, the entry barriers are lower than they appear, and the demand for people with systems-plus-ML knowledge is very high.

Now go build something.

↳ §27.3 + the full book

END OF CH.27 — Reading research like a researcher.
§1 (how to attack a paper: three-pass methodology, red flags, replicate test) · §2 (frontier lab reality: roles, day-to-day, what papers hide) · §3 (open problems 2026, career stage advice, the closing reflection).

END OF PART VI — The Frontier.

END OF THE BOOK — From Systems to Frontier ML.

27 chapters. 81 sections. ~50 C kernels demonstrating real computation. ~20 interactive Svelte visualisations. The math from first principles up through the systems engineering that runs modern frontier models.

Started with an experienced systems engineer wanting to come up to speed on modern ML. Ended with — hopefully — the foundation to keep going independently. The field will continue to evolve. The math will not.

Thank you for reading. Now go build something.