BEYOND TRANSFORMERS
Section 20.3
03

Where this stands — an honest assessment

It’s 2025. Mamba is two years old. RWKV is older. Hybrid models have shipped commercially. Has the post-transformer revolution arrived? No, not yet, and probably not in the form people expected. The frontier — GPT-5, Claude 4, Gemini 2 — is still pure attention. The mid-range is increasingly hybrid. Pure SSMs occupy a research niche. This section is the candid take: where each architecture wins, where it doesn’t, what the realistic 2-3 year trajectory looks like, and what to actually care about as a practitioner. Less hype, more honest accounting.

Where attention still wins (the frontier)

The headline: every frontier model in 2025 is pure attention. GPT-5 (OpenAI), Claude 4 (Anthropic), Gemini 2 (Google), Grok 3 (xAI), Llama 4 (Meta), DeepSeek V3 (DeepSeek), Qwen 3 (Alibaba) — all transformer-attention based. None Mamba, none hybrid, none SSM.

Why?

  1. Quality at the limit. The 1-2% perplexity gap between pure attention and hybrid Mamba MATTERS when you’re competing on benchmarks. At ~$10M/run training cost, a 1% loss improvement is worth millions in inference economics.
  2. Tooling maturity. Every training stack, every inference engine, every interpretability tool, every fine-tuning library is tuned for attention. SSMs require new tools and have less community knowledge.
  3. Risk aversion. Frontier labs don’t want their model to be “the one that proves SSMs were wrong.” Cost of failure is catastrophic; benefit of being right is marginal.
  4. Diminishing benefit at frontier scale. SSM efficiency wins matter most when compute is the bottleneck. Frontier labs have effectively unlimited compute; their bottleneck is data quality and post-training, not pretraining compute.

Where SSMs / hybrids genuinely win

For everyone except the top 5 labs, the picture is different:

— think, then check —

Decision tree:

1. What’s the target context length?

  • ≤ 32K: pure attention is fine. KV cache is manageable; FlashAttention efficient. Default to it.
  • 32K-128K: still attention-friendly. GQA + FlashAttention 3 handles this comfortably on H100/B200.
  • 128K-1M: hybrid (Jamba, Samba, Granite 4) becomes very competitive. Attention is feasible but expensive.
  • 1M+: hybrid or pure SSM clearly wins. Pure attention is prohibitively expensive.

2. What’s the deployment target?

  • Datacenter GPU (H100+): both options work. Pick by quality/cost.
  • Edge / mobile: SSM or hybrid. Fixed state dramatically simpler than growing KV cache.
  • Browser / WASM: SSM (or attention with very small context). Memory is the constraint.
  • Multi-tenant SaaS: attention’s KV-cache-per-request can dominate; hybrid is friendlier.

3. What’s the quality requirement?

  • ”State of the art”: pure attention. Hybrid is 1-3% behind on most benchmarks.
  • ”Production-ready chat / docs”: hybrid is comparable. Quality difference is small.
  • ”Specific domain fine-tune”: either works. Domain adaptation usually swamps architecture choice.

4. Dataset / training data availability?

  • Pretrain from scratch: hybrid models exist as open weights (Jamba, Granite 4); use those as starting points.
  • Fine-tune existing model: attention (Llama, Qwen) has 10× more ecosystem support for LoRA/QLoRA. SSM fine-tuning is younger.
  • Use API only: doesn’t matter — call the API.

Concrete recommendations:

  • Building a chat app with 32K context: Llama 3 / Qwen 2.5 (attention). Maturity + tooling wins.
  • Document understanding at 1M+ context: Jamba or Granite 4 (hybrid). Attention infeasible.
  • On-device assistant: Falcon-Mamba 7B or similar SSM. Memory budget matters.
  • State-of-the-art research model: attention (you’re competing at frontier; tooling matters).

The honest version: most projects in 2025 default to attention because the ecosystem is more mature. Hybrid/SSM is for specific use cases where context length or inference cost is the binding constraint.

Why pure SSMs probably won’t dominate

The structural argument against pure SSMs taking over:

The fundamental SSM limitation: FIXED STATE. Mamba state size: N · D ≈ 16 · 4096 = 64K floats per layer per sequence. Total state across 32 layers: ~2 MB. This 2 MB is supposed to compress ENTIRE prefix information. At T = 4K tokens (16 KB of input): compression ratio ~8×. Easy. At T = 32K tokens (128 KB): compression ratio ~64×. OK. At T = 256K tokens (1 MB): compression ratio ~512×. Lossy. At T = 1M tokens (4 MB): compression ratio ~2000×. Heavy loss. At very long context, the state CANNOT hold all relevant information. Recall failures multiply. Attention's KV cache (which grows with T) doesn't have this problem — it stores every token's K and V exactly. The fundamental attention advantage: EXACT MEMORY. Attention's KV cache: T · 2 · H_kv · d_k bytes — grows with T. At T = 1M: maybe 10-100 GB (depending on KV-head count). Costly in HBM, but information is THERE, accessible by any query. Hybrid models get the best of both: cheap O(N) Mamba layers handle bulk computation; few attention layers preserve precise recall.
— think, then check —

Why these tasks expose the weakness:

The common thread: tasks that require RETRIEVING SPECIFIC INFORMATION from earlier in the context, rather than producing fluent continuation.

Long-context recall (Needle in a Haystack):

Task: 100K tokens of context, a specific fact buried in the middle, query at the end. The model must locate and reproduce the fact.

Why SSM fails: the state’s fixed compression means the specific fact (a few tokens of information) gets averaged with 100K other tokens. The signal is lost in the noise.

Why attention works: every token’s K, V is stored exactly. The query at the end can DIRECTLY ATTEND to the buried fact’s keys, retrieving its value precisely.

Multi-hop reasoning:

Task: requires retrieving multiple distinct pieces of information and combining them.

Why SSM fails: each retrieval pass through the state is lossy; combining multiple retrievals compounds the loss. By hop 3-4, the information is too degraded.

Why attention works: each “hop” can be a fresh attention lookup with high fidelity. Multi-hop reasoning compounds reliably.

Code generation:

Task: write a function that uses utilities defined earlier in the file (or in the context). The model must reference specific symbols by exact name.

Why SSM fails: symbol names are arbitrary tokens with no semantic shorthand. The state can’t compress “function called calculate_metric” into anything shorter without losing the exact name. When asked to generate code that USES it, the recall is fuzzy.

Why attention works: tokens are stored exactly in the KV cache. When generating a call to the function, the model attends back to its definition and gets the exact name.

How hybrids close the gap:

The hybrid layers have a key insight: precise recall is needed ONLY occasionally. Most of the FFN/Mamba layers do bulk processing; the few attention layers handle the recall-needs.

For Needle in a Haystack: even one attention layer with access to the full KV cache can find the needle.

For multi-hop reasoning: a few attention layers can handle each hop precisely; the Mamba layers in between can do the “reasoning” computation cheaply.

For code generation: attention layers retrieve symbol names exactly; Mamba layers compose the code.

Empirically: hybrid Mamba+attention models recover 80-95% of the attention-only quality on these specific tasks, at substantially lower compute. Pure Mamba is ~50-70% — usable but noticeably worse.

The “few attention layers” approach is the pragmatic resolution to a fundamental architectural trade-off. It’s likely the long-term direction.

The 2-3 year trajectory

Predictions are hard, but here’s the calibrated guess:

2025: Frontier = pure attention. Hybrid takes 20-30% of mid-tier. 2026: Frontier = pure attention OR hybrid-experimental. Hybrid takes 40-50% of mid-tier. 2027: Frontier likely transitions to hybrid (or pure attention with major efficiency improvements). Pure SSMs remain a research / specialised niche. What's likely: - Attention models will adopt SSM-inspired techniques (linear attention variants, sliding window + global attention, state compression in long contexts). - The frontier will be "attention-flavored" but with O(N) layers mixed in. - The pure-SSM bet (Mamba 4, RWKV 8) will mature as research artifacts but may not displace attention at the frontier. What's possible but uncertain: - A "BitNet-like" surprise where a pure-SSM model at small scale shows clear quality advantages that justify investment in the architecture for frontier scale. - Hardware specialisation for SSMs (TPUs, custom accelerators) that changes the compute economics dramatically. What's unlikely: - Transformers being fully replaced in 2-3 years. - Mamba becoming the default in any new commercial frontier model in 2025-2026.
— think, then check —

Base prediction (most likely):

Frontier 2027 models will be predominantly attention-based but with significant SSM-inspired elements. The specific incarnation:

  • Most layers: attention with sliding-window or global+local mix, plus FlashAttention 4+ efficiency.
  • Some layers: pure Mamba or RWKV-style linear-time layers, used for long-distance integration.
  • Sublayer-level mixing: even within a transformer block, some heads might use attention and others Mamba.
  • Context length: 1-10M tokens routine. Some 100M-token experiments.

This is the “hybrid path” — attention dominant but with SSM-style components mixed in for efficiency.

Alternative path: full hybrid takeover.

If Jamba 4 or similar hits a breakthrough on quality, the frontier might shift to Mamba-dominant + attention layers. Probability: 30%.

Alternative path: attention reigns.

If hardware improvements (B200 → R100 → …) make attention’s O(N²) acceptable at 10M+ context, the frontier might never adopt SSMs. Probability: 25%.

Alternative path: something else.

A genuinely new architecture appears (e.g., a Diffusion-based language model from research, or a successor to both attention and SSMs). Probability: 15%. The last 3 years of “transformers will be replaced” haven’t yet delivered on this, but it’s possible.

What could change the prediction:

  • A pure-SSM frontier model proving its quality advantage at scale (e.g., DeepSeek or Anthropic shipping a pure-Mamba 70B that matches GPT-5).
  • A breakthrough in attention efficiency (FlashAttention 5+ with sub-quadratic scaling in practice).
  • Custom silicon for SSMs (Groq-like LPUs optimised for recurrent computation).
  • A “language modeling at 1M+ context” task that becomes economically critical and forces the architecture shift.

The honest investor advice:

Bet on hybrid architectures. They’re the path of least resistance — incremental adoption from existing attention models, with concrete cost wins. Pure SSM bets are higher-variance: bigger upside if they work, but they’ve underperformed expectations for two years now.

The honest practitioner advice:

Default to attention models (Llama, Qwen, etc.) for any new project. Switch to hybrid (Jamba, Granite 4) ONLY when long context or inference cost is the binding constraint. Don’t bet on pure SSMs in production until they prove themselves at scale.

END OF CH.20 — Beyond Transformers.
§1 (SSMs and Mamba: O(N) recurrence + input-dependent selectivity) · §2 (hybrid architectures: Jamba, Zamba, Samba, RWKV — the “few attention layers” pattern) · §3 (honest assessment: attention dominates frontier, hybrids take mid-range, pure SSMs niche).

END OF PART IV — What Makes an LLM. The full lifecycle from pretraining through alignment through fine-tuning, plus the alternatives.

Next: Part V opens with Ch.21 — The hardware substrate. GPU memory hierarchy, the roofline model, tensor cores, why H100 vs B200 matters.