Residual Connections: Why Transformers Can Be Deep

4 minute read

Published:

TL;DR: A residual connection adds the input of a sub-layer directly to its output: y = x + f(x). This creates a highway for gradients to bypass any sub-layer during backpropagation, enabling very deep networks to train stably. It also encourages each layer to learn small refinements rather than full transformations.

The Problem with Deep Networks

Stacking many layers allows a model to learn increasingly abstract representations. But deep networks face a fundamental training problem: vanishing gradients.

During backpropagation, gradients are computed by repeated multiplication through the chain rule. In a network with L layers, the gradient of the loss with respect to early-layer weights involves multiplying L Jacobians together. If each Jacobian has singular values less than 1 (common for standard activations), the gradient shrinks exponentially. Early layers learn almost nothing.

This is why naive deep networks (without tricks) perform worse than shallower ones โ€” a counter-intuitive result that motivated residual connections.

The Residual Fix

Introduced by He et al. (2015) for CNNs (ResNet), residual connections add the input directly to the output of each sub-layer:

y = x + f(x)

Where x is the input, f(x) is whatever the sub-layer computes (attention, FFN, etc.), and y is the output.

This changes what the sub-layer must learn. Instead of learning a full transformation from x to the desired output, it only needs to learn the residual โ€” the difference between x and the desired output. If no change is needed, f(x) = 0 works perfectly (identity function).

Why Gradients Flow Better

In a standard deep network, the gradient of the loss L with respect to an early activation x_l is:

โˆ‚L/โˆ‚x_l = โˆ‚L/โˆ‚x_L ยท โˆแตขโ‚Œโ‚—แดธ โˆ‚f_i/โˆ‚x_{i-1}

This is a product of L Jacobians โ€” exponentially small or large.

With residual connections, y_l = x_l + f(x_l), so:

โˆ‚y_l/โˆ‚x_l = 1 + โˆ‚f/โˆ‚x_l

The gradient always includes the 1 term โ€” a direct, unattenuated path from output to input. Even if โˆ‚f/โˆ‚x_l โ‰ˆ 0 (a saturated or poorly-conditioned sub-layer), the gradient still flows back as 1.

Summing over all paths: gradients reach early layers directly via the skip connections. Deep networks become trainable.

Intuition: Think of the residual connection as a highway. The sub-layer can refine the signal traveling along the highway, but the highway exists regardless. Gradients can always take the highway home, bypassing any congested sub-layer.

Residuals as Incremental Refinements

The residual formulation y = x + f(x) has another interpretation: each layer proposes a small correction to the current representation.

If f is initialised near zero (which happens naturally with small random weights), then at the start of training y โ‰ˆ x. The network begins as a near-identity function โ€” a useful initialisation since the untrained network does not corrupt the signal.

As training progresses, each layer learns to add increasingly meaningful corrections. This is why Transformers initialise stably even at 96 layers โ€” no single layer needs to do anything dramatic from the start.

In Transformers: Two Residuals per Block

Each Transformer block contains two sub-layers (attention and FFN), each with its own residual connection:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  x โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€+  โ”‚
โ”‚  โ”‚                                       โ”‚  โ”‚
โ”‚  โ””โ†’ LayerNorm โ†’ MultiHeadAttention โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                             โ”‚
โ”‚  x' โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€+ โ”‚
โ”‚  โ”‚                                       โ”‚ โ”‚
โ”‚  โ””โ†’ LayerNorm โ†’ FeedForward โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

GPT-3โ€™s 96 layers means 192 residual additions. At every single one, there is a direct gradient highway from the loss all the way back to the input.

The Residual Stream View

A useful mental model: think of the Transformer as a residual stream โ€” a single high-dimensional vector that persists across all layers. Each attention head and FFN block reads from this stream and writes back to it via residual addition.

This view, popularised by mechanistic interpretability research (Elhage et al., 2021), makes it clear that:

  • Information is preserved across layers (it stays in the stream)
  • Each layer adds information rather than replacing it
  • Individual layers can be interpreted as reading/writing to a shared memory

What Happens Without Residuals?

Ablation studies confirm: removing residual connections from deep Transformers causes:

  • Training instability (loss spikes, divergence)
  • Significantly worse final performance
  • Requirement for much more careful learning rate tuning

Adding them back is cheap โ€” it is a single addition with no parameters โ€” but the effect is profound.

Summary

Without residualsWith residuals
Gradients vanish in early layersGradients flow via identity skip
Layers learn full transformationsLayers learn small refinements
Deep networks hard to trainDeep networks train stably
Initialisation is fragileInitialisation is near-identity
Performance degrades with depthPerformance improves with depth

Residual connections are the single most important structural element that allows Transformers to scale to hundreds of layers. They cost almost nothing (one addition) but change everything.