The Transformer Block: Putting It All Together

5 minute read

Published:

TL;DR: A Transformer block = LN โ†’ MHA โ†’ residual โ†’ LN โ†’ FFN โ†’ residual. This unit is stacked N times. Understanding one block means understanding the entire architecture. Every modern LLM is just this pattern repeated at scale.

The Block is the Atom

The Transformer is not a complex monolith. It is a simple building block โ€” the Transformer block โ€” stacked repeatedly. GPT-2 small stacks 12. GPT-3 stacks 96. LLaMA 3 (70B) stacks 80. But each block is identical in structure.

Understand one block; understand any Transformer.

Data Flow: Token Embeddings

Before the first block, each input token is converted to a vector via an embedding lookup and summed with a positional encoding:

"The"  โ†’ embedding[The]  + pos_enc[0]  โ†’ xโ‚€ โˆˆ โ„^d_model
"cat"  โ†’ embedding[cat]  + pos_enc[1]  โ†’ xโ‚ โˆˆ โ„^d_model
"sat"  โ†’ embedding[sat]  + pos_enc[2]  โ†’ xโ‚‚ โˆˆ โ„^d_model

These vectors form a matrix X โˆˆ โ„^{seq_len ร— d_model}. This matrix flows through the stack of blocks.

The Pre-LN Transformer Block (Modern Standard)

INPUT: x (shape: [seq_len, d_model])
โ”‚ โ”œโ”€โ”€โ”€ Identity copy โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โŠ• โ†โ”€โ”€โ” โ”‚ โ”‚ โ””โ”€โ”€โ†’ LayerNorm โ†’ MultiHeadAttention โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ x' (attended representation) โ”‚ โ”œโ”€โ”€โ”€ Identity copy โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โŠ• โ†โ”€โ”€โ” โ”‚ โ”‚ โ””โ”€โ”€โ†’ LayerNorm โ†’ FeedForward (MLP) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ OUTPUT: x'' (shape: [seq_len, d_model])

In equations:

x' = x + MHA( LayerNorm(x) )
x'' = x' + FFN( LayerNorm(x') )

Two additions. Two layer norms. One attention operation. One FFN. That is the entire block.

Step-by-Step Walkthrough

Step 1: Layer Norm (before attention)

The input x is normalised across its feature dimension. Each tokenโ€™s d_model-dimensional vector is scaled to zero mean and unit variance, then re-scaled by learned ฮณ and ฮฒ.

This stabilises the distribution entering attention, preventing runaway growth of attention logits.

Step 2: Multi-Head Attention

The normalised input is projected into Q, K, V for each head:

  • Q, K used to compute attention weights (which tokens attend to which)
  • V used to compute the attended output (what information is retrieved)

Heads run in parallel; their outputs are concatenated and projected back to d_model.

Output shape: [seq_len, d_model] โ€” same as input.

Step 3: First Residual Addition

The attention output is added back to the original x (before normalisation). This is the residual connection:

x' = x + attention_output

The original signal is preserved. The attention result is a small correction to it.

Step 4: Layer Norm (before FFN)

The post-attention representation xโ€™ is normalised again, feeding into the FFN with a well-conditioned distribution.

Step 5: Feed-Forward Network

The FFN processes each token position independently:

  • Project up: d_model โ†’ 4 ร— d_model
  • Nonlinearity: GELU or SwiGLU
  • Project down: 4 ร— d_model โ†’ d_model

The FFN does not mix positions โ€” it refines each tokenโ€™s representation in place.

Step 6: Second Residual Addition

x'' = x' + ffn_output

The FFNโ€™s contribution is added to xโ€™. Again, the skip connection preserves the signal.

xโ€™โ€™ is the output of the block and becomes the input to the next block.

Shapes Throughout One Block

StageTensor shape
Input x[L, d_model]
After LN (pre-attention)[L, d_model]
Q, K per head[L, d_k] each
V per head[L, d_v] each
Attention output (per head)[L, d_v]
After concat + project[L, d_model]
After residual[L, d_model]
After LN (pre-FFN)[L, d_model]
After Wโ‚ (FFN expand)[L, 4ยทd_model]
After Wโ‚‚ (FFN contract)[L, d_model]
After residual (output)[L, d_model]

The shape is always [L, d_model] entering and leaving the block. Stacking blocks does not change the shape โ€” only the content.

What Each Component Contributes

ComponentRole
Layer NormStabilises distributions; enables deep stacking
Multi-Head AttentionMixes information across positions
First ResidualPreserves input; enables gradient highway
Feed-Forward NetworkRefines per-position; stores knowledge
Second ResidualPreserves input; enables gradient highway
The separation of concerns: Attention handles where to look (cross-position). FFN handles what to do at each position (per-token). Residuals ensure both can be bypassed if needed. Layer norms keep everything stable. Each component is minimal, independent, and essential.

The Stack

A full Transformer is this block, repeated N times, followed by a final layer norm and an output head:

Token + Positional Embeddings
         โ†“
   [Block 1]  โ† LN โ†’ MHA โ†’ + โ†’ LN โ†’ FFN โ†’ +
         โ†“
   [Block 2]
         โ†“
      ...
         โ†“
   [Block N]
         โ†“
   Final LayerNorm
         โ†“
   Output projection (lm_head): d_model โ†’ vocab_size
         โ†“
   Logits โ†’ softmax โ†’ token probabilities

GPT-3 at 175B parameters is 96 of these blocks, each with d_model=12288, 96 attention heads, and d_ff=49152. The architecture is the same as described here. The only differences are scale and a few engineering choices (RoPE, SwiGLU, grouped-query attention in modern models).

Summary

The Transformer block is:

  1. LN + MHA + Residual โ€” cross-position information gathering
  2. LN + FFN + Residual โ€” per-position processing and knowledge retrieval

Everything else in a Transformer โ€” BERT, GPT, T5, ViT, LLaMA โ€” is a combination of how these blocks are arranged, what masking strategy is used, and what input/output heads are attached. The block itself is always the same.