Activation functions — ReLU, GELU, SiLU, SwiGLU

Section 10.2

Activation functions — ReLU, GELU, SiLU, SwiGLU

In §10.1 you saw why MLPs need a nonlinearity between linear layers (otherwise stacked layers collapse into a single linear function). What you didn’t see is that the choice of nonlinearity is one of the most consequential architectural decisions in modern deep learning. Between 1990 and 2012, the field used sigmoid and tanh as defaults — and could barely train networks deeper than ~10 layers because of vanishing gradients. ReLU’s introduction (Glorot et al. 2011, Krizhevsky et al. 2012’s AlexNet) was the single nonlinearity change that made deep CNNs trainable. Since then, the transformer era has refined the choice further: GELU in BERT/GPT-2, SiLU in Llama, SwiGLU in Llama-2 onward. The arc is short, the math is clean, and each step was driven by an empirical observation about gradient flow.

Sigmoid and tanh — and what went wrong

The classical activations:

σ(x) = 1 / (1 + e⁻ˣ) (sigmoid, range (0, 1)) tanh(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ) (range (−1, 1))

Both are smooth, bounded, monotonically increasing. They were the natural choice in the 1980s — sigmoid because it looks like a probability output, tanh because it’s zero-centred. Their derivatives:

σ'(x) = σ(x) (1 − σ(x)) peaks at 0.25 when x = 0 → 0 as |x| → ∞ tanh'(x) = 1 − tanh²(x) peaks at 1 when x = 0 → 0 as |x| → ∞

Both have a vanishing-tail problem: for inputs more than ~5 units away from zero, the derivative is essentially zero. During backprop, the gradient at layer L involves a product of activation derivatives across L intervening layers. If each derivative is < 1, the gradient exponentially decays with depth. Past ~10 layers of sigmoid/tanh, the gradient that reaches the early layers is numerically zero — those weights stop updating.

This is the vanishing gradient problem historical obstacle When training a deep neural network, the gradients computed by backprop pass through a chain of activation derivatives — one per layer. If each derivative is less than 1 (as with sigmoid/tanh in their tails), the product exponentially shrinks with depth, and the gradient reaching early layers approaches zero. Those layers stop training; the network can't learn deep representations. Resolved (mostly) by ReLU activations + careful initialisation (He init) + skip connections, all post-2012. Then → now: the term was coined by Hochreiter 1991 (his diploma thesis, in German). Famous for being the obstacle that LSTM (Hochreiter & Schmidhuber 1997) was designed to overcome for RNNs. For feedforward networks, the resolution came in 2012 with ReLU + AlexNet. The phrase 'vanishing gradient' is still in active use as a diagnostic — it's what's happening when a training loss plateaus and the gradients on early layers are numerically zero. — the structural reason pre-2012 networks rarely went deeper than 8 layers.

ReLU · max(0, x)

breakthrough (2012, Krizhevsky AlexNet)

max(0, x). The 2012 breakthrough — Krizhevsky et al.'s AlexNet on ImageNet used ReLU. Gradient is 1 for positive inputs (no vanishing!) and 0 for negatives (one form of regularisation: ~50% of units inactive per batch). The non-smooth kink is fine for backprop in practice. Failure mode: 'dying ReLU' — units that stop activating never recover their gradient.

Eight activations across NN history. The dashed line is the derivative — where it's flat (sigmoid/tanh tails), backprop can't propagate signal past that layer. ReLU's flat positive derivative was the unlock that made deep networks trainable; GELU/SiLU/SwiGLU are the modern smooth-and-gated variants that transformers use.

Click through the tabs. Watch the dashed derivative line: sigmoid and tanh both saturate to zero in their tails. ReLU’s derivative is just 1 where the input was positive — no vanishing, no exponentially-shrinking gradient chain.

ReLU — the 2012 unlock

ReLU(x) = max(0, x) ← Rectified Linear Unit ReLU'(x) = { 1 if x > 0, 0 otherwise }

ReLU breakthrough Rectified Linear Unit: f(x) = max(0, x). Derivative is 1 for positive inputs, 0 for negative. The activation function that broke the vanishing-gradient barrier and made very deep networks trainable — Krizhevsky et al. 2012 AlexNet was the dramatic public demonstration. Cheap (one max op), no exponentials, no saturation in the positive direction. Three failure modes: dying ReLUs (units that get stuck at zero never recover), non-differentiable at zero (handled by convention), unbounded outputs (can require careful normalisation). ’s breakthrough: the positive-side gradient is exactly 1 — no shrinkage during the chain rule, no matter how deep the network goes. This single change made deep CNNs (AlexNet, VGG, GoogLeNet, ResNet) trainable. Before 2012, the standard ImageNet result was ~25% top-1 error; AlexNet (with ReLU + GPU + dropout + data aug) got 15.3%. The error rate has dropped most years since.

Three operational properties worth keeping:

Cheap. One max operation per output. No exp, no log, no division. Hardware loves it.
Sparse. Random pre-activations are positive ~half the time; the others contribute zero to the output. Each forward pass uses ~50% of the units; “free” regularisation.
Dying ReLU. If a unit’s pre-activation becomes consistently negative (e.g., due to a very negative bias after a few steps), its gradient stays at exactly zero forever — it can never recover. Leaky ReLU (small positive slope on the negative side) was the early fix; modern smooth variants are the production answer.

— think, then check —

Backprop computes gradients via the chain rule: ∂L/∂(layer 1) = product of activation derivatives across all subsequent layers. If each activation derivative is less than 1 (sigmoid peaks at 0.25; tanh peaks at 1.0 but is < 0.5 for |x| > 1), the product shrinks exponentially with depth.

For a 20-layer sigmoid network: gradient at layer 1 is on the order of 0.25²⁰ ≈ 9 × 10⁻¹³ — numerically zero in float32. The early layers’ weights stop updating. The network can’t learn deep representations even though the architecture has the capacity to.

ReLU’s positive-side derivative is exactly 1, so the gradient passes through unchanged. No exponential decay. A 100-layer ReLU network has roughly the same gradient magnitude at layer 1 as at layer 99. This single property is why the deep-learning revolution from 2012 onward was even possible — AlexNet (Krizhevsky 2012) couldn’t have trained 8 layers of sigmoid, let alone the 152 layers of ResNet (He 2015) that followed.

↳ §10.2 vanishing gradients

The smooth-and-gated family — GELU, SiLU, SwiGLU

Around 2017–2018 the field started asking: can we get ReLU’s gradient flow without the kink? The kink at zero is mildly inconvenient (no second derivative, possible “dying” units), and a smooth replacement might generalise slightly better.

GELU (Gaussian Error Linear Unit), introduced by Hendrycks & Gimpel 2016, is exactly this:

GELU(x) = x · Φ(x) where Φ is the standard normal CDF equivalently: GELU(x) = x · P(Z ≤ x) where Z ~ N(0, 1) In code: tanh approximation for speed: GELU(x) ≈ 0.5 x (1 + tanh(√(2/π) (x + 0.044715 x³)))

The intuition: “multiply x by the probability that x is positive under a unit Gaussian.” Becomes x for large positive x (probability → 1), becomes 0 for large negative x (probability → 0), smooth interpolation between. GELU was the default in BERT, GPT-2, T5 — the first generation of transformers.

SiLU (also called Swish), introduced by Ramachandran, Zoph & Le 2017 via neural architecture search:

SiLU(x) = x · σ(x)

Looks similar to GELU but using the simpler sigmoid as the “probability of positive” estimate. Faster to compute (no Gaussian CDF). SiLU became the default in Llama, Mistral, Qwen.

SwiGLU (Swish + Gated Linear Unit), introduced by Shazeer 2020, takes the next step:

SwiGLU(x; W, V) = SiLU(W x) ⊙ (V x) Two parallel linear projections, one passes through SiLU, the other passes through unchanged; combined by elementwise multiply. Operationally — the FFN sublayer becomes: y = W₂ · ( SiLU(W₁_a · x) ⊙ (W₁_b · x) ) + b₂ This is one more matrix multiply per FFN sublayer than vanilla SiLU — but with the freed parameter budget, it consistently improves the model's per-FLOP loss.

The gating idea: one branch controls which features contribute, the other contributes the values. It’s a kind of multiplicative attention applied per-token within the FFN. Used in Llama-2+, PaLM, Mistral 7B, every Llama derivative — the default for transformer FFN activations as of late 2025.

— think, then check —

Addresses vanishing gradients: in the chain rule, ∂L/∂(layer 1) is a product of layer-wise activation derivatives. If each derivative is 1 (ReLU on positive inputs), the product stays at 1 regardless of depth — gradient magnitude is preserved end-to-end. A 152-layer ResNet trains fine because each ReLU’s positive-side derivative passes the gradient through unchanged.

Introduces dying ReLU: for inputs where the ReLU is in its zero region, the local derivative is exactly 0. If a unit’s pre-activation becomes consistently negative (e.g. due to weight updates or a very negative bias), all gradients to that unit are zero, and it stops learning permanently. A ‘dead’ unit contributes nothing.

Why dying ReLU is less serious:

It’s localised — affects individual units, not the network’s depth. A 30% dead-unit fraction at layer 1 doesn’t prevent layer 100 from training (which vanishing gradients would).
It’s fixable — leaky ReLU, GELU, SiLU all have non-zero derivatives everywhere; dying-ReLU literally can’t happen with them.
It’s self-limiting — a network with too many dead units will have higher training loss, so during training the optimizer naturally pushes activations to avoid landing in the dead region.

Vanishing gradients were structural — couldn’t be fixed without a fundamentally different activation. Dying ReLU is a tuning issue. The trade was the right one to make in 2012, and the modern smooth variants (GELU, SiLU, SwiGLU) fix dying-ReLU directly while preserving the gradient-flow benefit.

↳ §10.2 ReLU + Ch.4 §3 chain rule

Counting the FFN’s parameter delta — vanilla vs SwiGLU

A standard transformer FFN sublayer is d → 4d → d with a single ReLU/GELU/SiLU. Switching to SwiGLU adds one more d → 4d projection — but to keep the parameter budget the same, the SwiGLU FFN typically uses inner dim 8d/3 instead of 4d:

Vanilla SiLU FFN (Llama-1): Parameters W₁ : d → 4d 4 d² W₂ : 4d → d 4 d² total 8 d² = 8 · 4096² ≈ 134M (for d = 4096) SwiGLU FFN (Llama-2+): Parameters W₁_a : d → 8d/3 8/3 d² W₁_b : d → 8d/3 8/3 d² W₂ : 8d/3 → d 8/3 d² total 8 d² (same!) but with the extra gate.

So SwiGLU “buys” the gating with a slightly narrower inner dim, maintaining the same parameter count. Empirically the SwiGLU variant gets ~1–2% lower training loss at the same compute — consistent across model scale. Shazeer 2020 (“GLU Variants Improve Transformer,” arXiv:2002.05202) showed this systematically across many GLU variants; SwiGLU won and has been the default since.

— think, then check —

Vanilla SiLU FFN uses one input projection W₁ (d × 4d) and one output projection W₂ (4d × d). Total parameters: d × 4d + 4d × d = 8d².

SwiGLU FFN uses TWO input projections (one for the SiLU branch, one for the gate) and one output. With inner dim D: W₁_a (d × D), W₁_b (d × D), W₂ (D × d). Total = 3·dD.

To match the vanilla parameter count: 3dD = 8d², so D = 8d/3. With d = 4096, D = 10923 — typically rounded to the nearest multiple of 64 or 128 for SIMD alignment (Llama-2 uses 11008).

The interpretation: SwiGLU ‘spends’ the parameter budget that would have gone to a wider single projection on having two narrower ones plus the elementwise gating multiply. The extra structural flexibility — gated control over which features contribute — empirically outperforms the wider non-gated version at the same compute and parameter count.

This is the modern recipe: any time a paper proposes ‘replace activation X with activation Y’ where Y has more parameters per FFN, the comparison should hold the total parameter count constant by narrowing X’s inner dim. Otherwise you’re just measuring ‘more parameters = better,’ which isn’t an architecture claim.

↳ §10.2 SwiGLU

END OF CH.10 §2 — Activation functions.
Built: ActivationZoo viz (tabs through seven activations from sigmoid to SwiGLU, with derivative plotted as dashed line). Three recall items: easy (vanishing gradients), medium (ReLU’s gradient-flow fix vs dying-ReLU tradeoff), hard (the SwiGLU inner-dim 8d/3 arithmetic from Llama-2).
Coming next: §10.3 — Train a 2-layer MLP end-to-end with AdamW on the two-moons dataset. Reuses Ch.9’s autograd library, swaps SGD for AdamW from Ch.8 §3.