Feed-Forward Networks: The Forgotten Half of Transformers
Published:
The FFN Is Half the Block
Every Transformer block follows this pattern:
x → MultiHeadAttention → residual + LN → FeedForward → residual + LN → output
The FeedForward (FFN) sub-layer is the second half of every block. In popular Transformer explanations, it is often described in one sentence and then forgotten in favour of attention. This is a mistake — the FFN is critical.
The Architecture of the FFN
The FFN is a simple two-layer MLP applied position-wise: each token is processed identically and independently.
- W₁ ∈ ℝ^{d_model × d_ff}: projects up from d_model to d_ff
- activation: nonlinearity (ReLU, GELU, or SwiGLU)
- W₂ ∈ ℝ^{d_ff × d_model}: projects back down
- d_ff = 4 × d_model in most models (e.g., 512 → 2048, or 4096 → 16384)
The 4× expansion and contraction is standard but not derived from first principles — it was established empirically in the original paper and has remained the default.
Parameter Count: FFN Dominates
For a model with d_model = 1024 and d_ff = 4096, in each block:
| Sub-layer | Parameters |
|---|---|
| Multi-head attention (4 matrices) | 4 × 1024² = 4.2M |
| FFN (2 matrices) | 2 × 1024 × 4096 = 8.4M |
The FFN holds twice as many parameters as the attention sub-layer. In a 96-layer model, FFNs collectively account for roughly 2/3 of all parameters.
What Does the FFN Actually Do?
Attention vs FFN: Division of Labour
Research into Transformer internals has revealed a rough division:
- Attention heads move information between positions — they determine which tokens influence each other and gather context
- FFN layers process information at a single position — they apply transformations and recall facts
This is why you can have a model that “knows” Paris is the capital of France even though that fact was not encoded in the positional attention pattern of the current context — the FFN retrieves it.
FFN as a Key-Value Memory
A 2020 paper (Geva et al., “Transformer Feed-Forward Layers Are Key-Value Memories”) showed that the FFN can be interpreted as:
- W₁ rows (the “keys”): pattern detectors — each neuron in the expanded dimension activates for specific input patterns
- W₂ columns (the “values”): for each activated key, the corresponding value vector is added to the output
When a token activates a key neuron (because it matches a learned pattern), the associated value is retrieved and added to the representation. This is analogous to a soft content-addressable memory — the FFN stores and retrieves (token, fact) associations.
The Nonlinearity: ReLU, GELU, SwiGLU
ReLU (original Transformer, 2017)
Simple and sparse — negative activations are exactly zero, which gives the FFN a sparse, efficient structure.
GELU (GPT-2, BERT, and successors)
Smooth approximation of ReLU with non-zero gradient for negative inputs. Empirically outperforms ReLU on most language tasks.
SwiGLU (LLaMA, PaLM, Mistral)
A gated variant: two parallel linear projections, one gating the other element-wise. SwiGLU-based FFNs use d_ff = (8/3) × d_model (not 4×) to keep parameter count comparable. Consistently outperforms ReLU and GELU at large scale.
Position-Wise Independence: A Key Property
The FFN processes each token independently — it does not look at neighbouring tokens. There is no attention-like mechanism: the computation for position i uses only the vector at position i.
This means:
- Parallelisable across positions (all tokens in a sequence processed simultaneously)
- No position-to-position information mixing — that is strictly the role of attention
- The FFN refines each token’s representation in place; it does not redistribute information
Sparse FFNs: MoE
Mixture-of-Experts (MoE) Transformers replace the dense FFN with multiple expert FFNs, routing each token to only a subset (often 2 out of 64 or more experts):
token → router → expert_k → output
This allows vastly more total parameters (stored in expert FFNs) while keeping computation constant (only a fraction is used per token). Models like Mixtral 8×7B and GPT-4 (allegedly) use MoE in the FFN sub-layer.
Summary
| Property | Value |
|---|---|
| Architecture | Two-layer MLP with expansion |
| Expansion factor | 4× (ReLU/GELU) or 8/3× (SwiGLU) |
| Applied to | Each token independently |
| Parameter share | ~2/3 of total in standard models |
| Information role | Per-position processing and fact retrieval |
| Attention role comparison | Attention mixes positions; FFN refines each position |
The FFN is not attention’s sidekick. It is an equal partner — the knowledge storage and processing unit that sits beside attention’s information-routing mechanism.
