The Transformer Block: Putting It All Together
Published:
The Block is the Atom
The Transformer is not a complex monolith. It is a simple building block โ the Transformer block โ stacked repeatedly. GPT-2 small stacks 12. GPT-3 stacks 96. LLaMA 3 (70B) stacks 80. But each block is identical in structure.
Understand one block; understand any Transformer.
Data Flow: Token Embeddings
Before the first block, each input token is converted to a vector via an embedding lookup and summed with a positional encoding:
"The" โ embedding[The] + pos_enc[0] โ xโ โ โ^d_model
"cat" โ embedding[cat] + pos_enc[1] โ xโ โ โ^d_model
"sat" โ embedding[sat] + pos_enc[2] โ xโ โ โ^d_model
These vectors form a matrix X โ โ^{seq_len ร d_model}. This matrix flows through the stack of blocks.
The Pre-LN Transformer Block (Modern Standard)
โ โโโโ Identity copy โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโ โ โ โโโโ LayerNorm โ MultiHeadAttention โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ x' (attended representation) โ โโโโ Identity copy โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโ โ โ โโโโ LayerNorm โ FeedForward (MLP) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ OUTPUT: x'' (shape: [seq_len, d_model])
In equations:
x'' = x' + FFN( LayerNorm(x') )
Two additions. Two layer norms. One attention operation. One FFN. That is the entire block.
Step-by-Step Walkthrough
Step 1: Layer Norm (before attention)
The input x is normalised across its feature dimension. Each tokenโs d_model-dimensional vector is scaled to zero mean and unit variance, then re-scaled by learned ฮณ and ฮฒ.
This stabilises the distribution entering attention, preventing runaway growth of attention logits.
Step 2: Multi-Head Attention
The normalised input is projected into Q, K, V for each head:
- Q, K used to compute attention weights (which tokens attend to which)
- V used to compute the attended output (what information is retrieved)
Heads run in parallel; their outputs are concatenated and projected back to d_model.
Output shape: [seq_len, d_model] โ same as input.
Step 3: First Residual Addition
The attention output is added back to the original x (before normalisation). This is the residual connection:
x' = x + attention_output
The original signal is preserved. The attention result is a small correction to it.
Step 4: Layer Norm (before FFN)
The post-attention representation xโ is normalised again, feeding into the FFN with a well-conditioned distribution.
Step 5: Feed-Forward Network
The FFN processes each token position independently:
- Project up: d_model โ 4 ร d_model
- Nonlinearity: GELU or SwiGLU
- Project down: 4 ร d_model โ d_model
The FFN does not mix positions โ it refines each tokenโs representation in place.
Step 6: Second Residual Addition
x'' = x' + ffn_output
The FFNโs contribution is added to xโ. Again, the skip connection preserves the signal.
xโโ is the output of the block and becomes the input to the next block.
Shapes Throughout One Block
| Stage | Tensor shape |
|---|---|
| Input x | [L, d_model] |
| After LN (pre-attention) | [L, d_model] |
| Q, K per head | [L, d_k] each |
| V per head | [L, d_v] each |
| Attention output (per head) | [L, d_v] |
| After concat + project | [L, d_model] |
| After residual | [L, d_model] |
| After LN (pre-FFN) | [L, d_model] |
| After Wโ (FFN expand) | [L, 4ยทd_model] |
| After Wโ (FFN contract) | [L, d_model] |
| After residual (output) | [L, d_model] |
The shape is always [L, d_model] entering and leaving the block. Stacking blocks does not change the shape โ only the content.
What Each Component Contributes
| Component | Role |
|---|---|
| Layer Norm | Stabilises distributions; enables deep stacking |
| Multi-Head Attention | Mixes information across positions |
| First Residual | Preserves input; enables gradient highway |
| Feed-Forward Network | Refines per-position; stores knowledge |
| Second Residual | Preserves input; enables gradient highway |
The Stack
A full Transformer is this block, repeated N times, followed by a final layer norm and an output head:
Token + Positional Embeddings
โ
[Block 1] โ LN โ MHA โ + โ LN โ FFN โ +
โ
[Block 2]
โ
...
โ
[Block N]
โ
Final LayerNorm
โ
Output projection (lm_head): d_model โ vocab_size
โ
Logits โ softmax โ token probabilities
GPT-3 at 175B parameters is 96 of these blocks, each with d_model=12288, 96 attention heads, and d_ff=49152. The architecture is the same as described here. The only differences are scale and a few engineering choices (RoPE, SwiGLU, grouped-query attention in modern models).
Summary
The Transformer block is:
- LN + MHA + Residual โ cross-position information gathering
- LN + FFN + Residual โ per-position processing and knowledge retrieval
Everything else in a Transformer โ BERT, GPT, T5, ViT, LLaMA โ is a combination of how these blocks are arranged, what masking strategy is used, and what input/output heads are attached. The block itself is always the same.
