Activation functions — ReLU, GELU, SiLU, SwiGLU
In §10.1 you saw why MLPs need a nonlinearity between linear layers (otherwise stacked layers collapse into a single linear function). What you didn’t see is that the choice of nonlinearity is one of the most consequential architectural decisions in modern deep learning. Between 1990 and 2012, the field used sigmoid and tanh as defaults — and could barely train networks deeper than ~10 layers because of vanishing gradients. ReLU’s introduction (Glorot et al. 2011, Krizhevsky et al. 2012’s AlexNet) was the single nonlinearity change that made deep CNNs trainable. Since then, the transformer era has refined the choice further: GELU in BERT/GPT-2, SiLU in Llama, SwiGLU in Llama-2 onward. The arc is short, the math is clean, and each step was driven by an empirical observation about gradient flow.
Sigmoid and tanh — and what went wrong
The classical activations:
Both are smooth, bounded, monotonically increasing. They were the natural choice in the 1980s — sigmoid because it looks like a probability output, tanh because it’s zero-centred. Their derivatives:
Both have a vanishing-tail problem: for inputs more than ~5 units away from zero, the derivative is essentially zero. During backprop, the gradient at layer L involves a product of activation derivatives across L intervening layers. If each derivative is < 1, the gradient exponentially decays with depth. Past ~10 layers of sigmoid/tanh, the gradient that reaches the early layers is numerically zero — those weights stop updating.
This is the vanishing gradient problem — the structural reason pre-2012 networks rarely went deeper than 8 layers.
Click through the tabs. Watch the dashed derivative line: sigmoid and tanh both saturate to zero in their tails. ReLU’s derivative is just 1 where the input was positive — no vanishing, no exponentially-shrinking gradient chain.
ReLU — the 2012 unlock
ReLU’s breakthrough: the positive-side gradient is exactly 1 — no shrinkage during the chain rule, no matter how deep the network goes. This single change made deep CNNs (AlexNet, VGG, GoogLeNet, ResNet) trainable. Before 2012, the standard ImageNet result was ~25% top-1 error; AlexNet (with ReLU + GPU + dropout + data aug) got 15.3%. The error rate has dropped most years since.
Three operational properties worth keeping:
- Cheap. One max operation per output. No exp, no log, no division. Hardware loves it.
- Sparse. Random pre-activations are positive ~half the time; the others contribute zero to the output. Each forward pass uses ~50% of the units; “free” regularisation.
- Dying ReLU. If a unit’s pre-activation becomes consistently negative (e.g., due to a very negative bias after a few steps), its gradient stays at exactly zero forever — it can never recover. Leaky ReLU (small positive slope on the negative side) was the early fix; modern smooth variants are the production answer.
Backprop computes gradients via the chain rule: ∂L/∂(layer 1) = product of activation derivatives across all subsequent layers. If each activation derivative is less than 1 (sigmoid peaks at 0.25; tanh peaks at 1.0 but is < 0.5 for |x| > 1), the product shrinks exponentially with depth.
For a 20-layer sigmoid network: gradient at layer 1 is on the order of 0.25²⁰ ≈ 9 × 10⁻¹³ — numerically zero in float32. The early layers’ weights stop updating. The network can’t learn deep representations even though the architecture has the capacity to.
ReLU’s positive-side derivative is exactly 1, so the gradient passes through unchanged. No exponential decay. A 100-layer ReLU network has roughly the same gradient magnitude at layer 1 as at layer 99. This single property is why the deep-learning revolution from 2012 onward was even possible — AlexNet (Krizhevsky 2012) couldn’t have trained 8 layers of sigmoid, let alone the 152 layers of ResNet (He 2015) that followed.
The smooth-and-gated family — GELU, SiLU, SwiGLU
Around 2017–2018 the field started asking: can we get ReLU’s gradient flow without the kink? The kink at zero is mildly inconvenient (no second derivative, possible “dying” units), and a smooth replacement might generalise slightly better.
GELU (Gaussian Error Linear Unit), introduced by Hendrycks & Gimpel 2016, is exactly this:
The intuition: “multiply x by the probability that x is positive under a unit Gaussian.” Becomes x for large positive x (probability → 1), becomes 0 for large negative x (probability → 0), smooth interpolation between. GELU was the default in BERT, GPT-2, T5 — the first generation of transformers.
SiLU (also called Swish), introduced by Ramachandran, Zoph & Le 2017 via neural architecture search:
Looks similar to GELU but using the simpler sigmoid as the “probability of positive” estimate. Faster to compute (no Gaussian CDF). SiLU became the default in Llama, Mistral, Qwen.
SwiGLU (Swish + Gated Linear Unit), introduced by Shazeer 2020, takes the next step:
The gating idea: one branch controls which features contribute, the other contributes the values. It’s a kind of multiplicative attention applied per-token within the FFN. Used in Llama-2+, PaLM, Mistral 7B, every Llama derivative — the default for transformer FFN activations as of late 2025.
Addresses vanishing gradients: in the chain rule, ∂L/∂(layer 1) is a product of layer-wise activation derivatives. If each derivative is 1 (ReLU on positive inputs), the product stays at 1 regardless of depth — gradient magnitude is preserved end-to-end. A 152-layer ResNet trains fine because each ReLU’s positive-side derivative passes the gradient through unchanged.
Introduces dying ReLU: for inputs where the ReLU is in its zero region, the local derivative is exactly 0. If a unit’s pre-activation becomes consistently negative (e.g. due to weight updates or a very negative bias), all gradients to that unit are zero, and it stops learning permanently. A ‘dead’ unit contributes nothing.
Why dying ReLU is less serious:
- It’s localised — affects individual units, not the network’s depth. A 30% dead-unit fraction at layer 1 doesn’t prevent layer 100 from training (which vanishing gradients would).
- It’s fixable — leaky ReLU, GELU, SiLU all have non-zero derivatives everywhere; dying-ReLU literally can’t happen with them.
- It’s self-limiting — a network with too many dead units will have higher training loss, so during training the optimizer naturally pushes activations to avoid landing in the dead region.
Vanishing gradients were structural — couldn’t be fixed without a fundamentally different activation. Dying ReLU is a tuning issue. The trade was the right one to make in 2012, and the modern smooth variants (GELU, SiLU, SwiGLU) fix dying-ReLU directly while preserving the gradient-flow benefit.
Counting the FFN’s parameter delta — vanilla vs SwiGLU
A standard transformer FFN sublayer is d → 4d → d with a single ReLU/GELU/SiLU. Switching to SwiGLU adds one more d → 4d projection — but to keep the parameter budget the same, the SwiGLU FFN typically uses inner dim 8d/3 instead of 4d:
So SwiGLU “buys” the gating with a slightly narrower inner dim, maintaining the same parameter count. Empirically the SwiGLU variant gets ~1–2% lower training loss at the same compute — consistent across model scale. Shazeer 2020 (“GLU Variants Improve Transformer,” arXiv:2002.05202) showed this systematically across many GLU variants; SwiGLU won and has been the default since.
Vanilla SiLU FFN uses one input projection W₁ (d × 4d) and one output projection W₂ (4d × d). Total parameters: d × 4d + 4d × d = 8d².
SwiGLU FFN uses TWO input projections (one for the SiLU branch, one for the gate) and one output. With inner dim D: W₁_a (d × D), W₁_b (d × D), W₂ (D × d). Total = 3·dD.
To match the vanilla parameter count: 3dD = 8d², so D = 8d/3. With d = 4096, D = 10923 — typically rounded to the nearest multiple of 64 or 128 for SIMD alignment (Llama-2 uses 11008).
The interpretation: SwiGLU ‘spends’ the parameter budget that would have gone to a wider single projection on having two narrower ones plus the elementwise gating multiply. The extra structural flexibility — gated control over which features contribute — empirically outperforms the wider non-gated version at the same compute and parameter count.
This is the modern recipe: any time a paper proposes ‘replace activation X with activation Y’ where Y has more parameters per FFN, the comparison should hold the total parameter count constant by narrowing X’s inner dim. Otherwise you’re just measuring ‘more parameters = better,’ which isn’t an architecture claim.
END OF CH.10 §2 — Activation functions.
Built: ActivationZoo viz (tabs through seven activations from sigmoid to SwiGLU, with derivative plotted as dashed line). Three recall items: easy (vanishing gradients), medium (ReLU’s gradient-flow fix vs dying-ReLU tradeoff), hard (the SwiGLU inner-dim 8d/3 arithmetic from Llama-2).
Coming next: §10.3 — Train a 2-layer MLP end-to-end with AdamW on the two-moons dataset. Reuses Ch.9’s autograd library, swaps SGD for AdamW from Ch.8 §3.