Layer Normalization in Transformers

4 minute read

Published: March 06, 2024

TL;DR: Layer norm rescales each token's feature vector to have zero mean and unit variance, then applies learned scale (γ) and shift (β). Post-LN (original Transformer) is less stable; Pre-LN (used by GPT-2, LLaMA) allows training without warmup and scales more reliably.

Why Normalisation at All?

Deep networks suffer from internal covariate shift: as weights update during training, the distribution of activations at each layer changes unpredictably. Later layers must constantly adapt to a moving target.

Normalisation layers stabilise these distributions. For Transformers, layer normalisation is the standard solution.

Layer Norm vs Batch Norm

Batch Normalisation normalises across the batch dimension: for each feature, compute mean and variance across all examples in the batch.

Problematic for variable-length sequences (different batch elements have different lengths)
Requires large batch sizes to estimate statistics reliably
Behaviour differs between training and inference (running mean/var at test time)

Layer Normalisation normalises across the feature dimension: for each token, compute mean and variance across all d_model features.

Independent of batch size and sequence length
Identical behaviour at training and inference time
Natural fit for sequence models

The Layer Norm Formula

Given a token representation x ∈ ℝ^d:

μ = (1/d) Σᵢ xᵢ σ² = (1/d) Σᵢ (xᵢ − μ)²

LayerNorm(x) = γ · (x − μ) / √(σ² + ε) + β

μ, σ² are computed per-token (across features)
γ (scale) and β (shift) are learned parameters, initialised to 1 and 0 respectively
ε (typically 1e-5) prevents division by zero

After normalisation, the output has approximately zero mean and unit variance. γ and β then allow the network to re-scale and re-shift to whatever distribution is optimal — without collapsing the normalisation.

Post-LN: The Original Placement

The 2017 Transformer paper placed layer norm after the residual addition:

x → [Attention] → x + attn(x) → LayerNorm → next layer

In full notation for one sub-layer:

y = LayerNorm(x + Sublayer(x))

This is called Post-LN (normalisation after the residual). It was the standard until roughly 2019.

Problem: In Post-LN, gradients must flow through the LayerNorm on the path back through the residual stream. At initialisation, this can produce very large or unstable gradients in early layers of deep networks. Post-LN models require careful learning rate warmup and are sensitive to hyperparameters.

Pre-LN: The Modern Standard

Pre-LN places layer norm before each sub-layer, inside the residual branch:

x → LayerNorm → [Attention] → x + attn(LayerNorm(x)) → next layer

In full notation:

y = x + Sublayer(LayerNorm(x))

The residual path remains a clean identity: y = x + f(x). Gradients can bypass the sub-layer entirely by flowing through the residual skip connection. This dramatically stabilises training.

Practical consequence: Pre-LN models can be trained without learning rate warmup, at higher learning rates, and to greater depth. GPT-2, GPT-3, LLaMA, PaLM, and most modern LLMs use Pre-LN. Post-LN is still used in BERT and some encoder models with careful tuning.

RMSNorm: A Simpler Variant

Many recent models (LLaMA, Mistral, Gemma) use RMSNorm instead of full layer norm:

RMSNorm(x) = γ · x / RMS(x) where RMS(x) = √( (1/d) Σᵢ xᵢ² )

RMSNorm removes the mean-centering step (no μ subtraction). This is:

Faster: ~15-20% less computation
Equally effective empirically
Motivated by the observation that re-centring contributes little to training stability

The scale γ is still learned; the shift β is dropped.

Where LayerNorm Appears in a Transformer Block

In a Pre-LN Transformer, each block looks like:

x → LN → MultiHeadAttention → + x → LN → FFN → + x
          ↑___________________↑       ↑__________↑
               residual                 residual

There are two layer norms per block: one before attention, one before the FFN. For a 96-layer model (GPT-3 scale), that is 192 LayerNorm operations per forward pass.

Comparison

Property	Post-LN	Pre-LN	RMSNorm
Gradient flow	Through LN	Via identity skip	Via identity skip
Training stability	Lower	Higher	Higher
Warmup required	Usually yes	Often no	Often no
Used by	BERT, original T5	GPT-2/3, LLaMA	LLaMA 2/3, Mistral
Mean-centering	Yes	Yes	No
Compute	Standard	Standard	~15% faster

Summary

Layer norm is not cosmetic. It controls how information flows and how gradients propagate through the network. The choice between Pre-LN and Post-LN explains many practical differences between model families — and Pre-LN’s superior stability is why it dominates modern large language model training.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Layer Normalization in Transformers

Why Normalisation at All?

Layer Norm vs Batch Norm

The Layer Norm Formula

Post-LN: The Original Placement

Pre-LN: The Modern Standard

RMSNorm: A Simpler Variant

Where LayerNorm Appears in a Transformer Block

Comparison

Summary

Share on

You May Also Enjoy

Flamingo, BLIP, and the Rise of Vision-Language Models

CLIP: Connecting Images and Text with Contrastive Learning

MAE: Masked Autoencoders Are Scalable Vision Learners

DeiT: Training ViTs Efficiently Without Large Datasets