Residual Connections: Why Transformers Can Be Deep

4 minute read

Published: March 07, 2024

TL;DR: A residual connection adds the input of a sub-layer directly to its output: y = x + f(x). This creates a highway for gradients to bypass any sub-layer during backpropagation, enabling very deep networks to train stably. It also encourages each layer to learn small refinements rather than full transformations.

The Problem with Deep Networks

Stacking many layers allows a model to learn increasingly abstract representations. But deep networks face a fundamental training problem: vanishing gradients.

During backpropagation, gradients are computed by repeated multiplication through the chain rule. In a network with L layers, the gradient of the loss with respect to early-layer weights involves multiplying L Jacobians together. If each Jacobian has singular values less than 1 (common for standard activations), the gradient shrinks exponentially. Early layers learn almost nothing.

This is why naive deep networks (without tricks) perform worse than shallower ones — a counter-intuitive result that motivated residual connections.

The Residual Fix

Introduced by He et al. (2015) for CNNs (ResNet), residual connections add the input directly to the output of each sub-layer:

y = x + f(x)

Where x is the input, f(x) is whatever the sub-layer computes (attention, FFN, etc.), and y is the output.

This changes what the sub-layer must learn. Instead of learning a full transformation from x to the desired output, it only needs to learn the residual — the difference between x and the desired output. If no change is needed, f(x) = 0 works perfectly (identity function).

Why Gradients Flow Better

In a standard deep network, the gradient of the loss L with respect to an early activation x_l is:

∂L/∂x_l = ∂L/∂x_L · ∏ᵢ₌ₗᴸ ∂f_i/∂x_{i-1}

This is a product of L Jacobians — exponentially small or large.

With residual connections, y_l = x_l + f(x_l), so:

∂y_l/∂x_l = 1 + ∂f/∂x_l

The gradient always includes the 1 term — a direct, unattenuated path from output to input. Even if ∂f/∂x_l ≈ 0 (a saturated or poorly-conditioned sub-layer), the gradient still flows back as 1.

Summing over all paths: gradients reach early layers directly via the skip connections. Deep networks become trainable.

Intuition: Think of the residual connection as a highway. The sub-layer can refine the signal traveling along the highway, but the highway exists regardless. Gradients can always take the highway home, bypassing any congested sub-layer.

The residual formulation y = x + f(x) has another interpretation: each layer proposes a small correction to the current representation.

If f is initialised near zero (which happens naturally with small random weights), then at the start of training y ≈ x. The network begins as a near-identity function — a useful initialisation since the untrained network does not corrupt the signal.

As training progresses, each layer learns to add increasingly meaningful corrections. This is why Transformers initialise stably even at 96 layers — no single layer needs to do anything dramatic from the start.

In Transformers: Two Residuals per Block

Each Transformer block contains two sub-layers (attention and FFN), each with its own residual connection:

┌─────────────────────────────────────────────┐
│  x ──────────────────────────────────────+  │
│  │                                       │  │
│  └→ LayerNorm → MultiHeadAttention ─────┘  │
│                                             │
│  x' ─────────────────────────────────────+ │
│  │                                       │ │
│  └→ LayerNorm → FeedForward ────────────┘ │
└─────────────────────────────────────────────┘

GPT-3’s 96 layers means 192 residual additions. At every single one, there is a direct gradient highway from the loss all the way back to the input.

The Residual Stream View

A useful mental model: think of the Transformer as a residual stream — a single high-dimensional vector that persists across all layers. Each attention head and FFN block reads from this stream and writes back to it via residual addition.

This view, popularised by mechanistic interpretability research (Elhage et al., 2021), makes it clear that:

Information is preserved across layers (it stays in the stream)
Each layer adds information rather than replacing it
Individual layers can be interpreted as reading/writing to a shared memory

What Happens Without Residuals?

Ablation studies confirm: removing residual connections from deep Transformers causes:

Training instability (loss spikes, divergence)
Significantly worse final performance
Requirement for much more careful learning rate tuning

Adding them back is cheap — it is a single addition with no parameters — but the effect is profound.

Summary

Without residuals	With residuals
Gradients vanish in early layers	Gradients flow via identity skip
Layers learn full transformations	Layers learn small refinements
Deep networks hard to train	Deep networks train stably
Initialisation is fragile	Initialisation is near-identity
Performance degrades with depth	Performance improves with depth

Residual connections are the single most important structural element that allows Transformers to scale to hundreds of layers. They cost almost nothing (one addition) but change everything.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Residual Connections: Why Transformers Can Be Deep

The Problem with Deep Networks

The Residual Fix

Why Gradients Flow Better

In Transformers: Two Residuals per Block

The Residual Stream View

What Happens Without Residuals?

Summary

Share on

You May Also Enjoy

Flamingo, BLIP, and the Rise of Vision-Language Models

CLIP: Connecting Images and Text with Contrastive Learning

MAE: Masked Autoencoders Are Scalable Vision Learners

DeiT: Training ViTs Efficiently Without Large Datasets

Alessio Borgi

The Problem with Deep Networks

The Residual Fix

Why Gradients Flow Better

Residuals as Incremental Refinements

In Transformers: Two Residuals per Block

The Residual Stream View

What Happens Without Residuals?

Summary

Share on

You May Also Enjoy

Flamingo, BLIP, and the Rise of Vision-Language Models

CLIP: Connecting Images and Text with Contrastive Learning

MAE: Masked Autoencoders Are Scalable Vision Learners

DeiT: Training ViTs Efficiently Without Large Datasets