SSMs and Mamba, why people are looking past attention, an honest assessment of where this stands.
State-space models (SSMs) replace attention with a linear recurrence — x_t = A·x_{t-1} + B·u_t, y_t = C·x_t — that runs in O(N) time vs attention‘s O(N²). The catch: classical SSMs have fixed A, B, C; they cannot do content-addressable lookup. Mamba (Gu & Dao 2023) makes B, C input-dependent — "selectivity" — and recovers attention-like behavior in O(N) time. For long sequences (N=128K+), this is a serious efficiency advantage.
Pure SSMs (Mamba) work well but lose to attention on precise long-range recall. Pure attention is O(N²) and impractical at very long context. The 2024+ pattern: HYBRID architectures with most layers as Mamba and a few attention layers interspersed. Jamba (AI21), Zamba (Zyphra), Samba (Microsoft), and Granite 4 all use this template. RWKV is the parallel-non-SSM lineage that also competes.
Pure attention dominates the frontier in 2025. Hybrid architectures (mostly Mamba + some attention) are gaining at the mid-range. Pure SSMs remain interesting research. The honest version: SSMs win on inference cost at very long context; attention wins on quality at the absolute frontier. The 2-3 year trajectory likely converges to "transformers with cheaper attention" (sliding window, FlashAttention 3+, SSM-augmented) rather than wholesale replacement of attention. This section is the candor section.
← ALL CHAPTERS