Multi-Head Attention: Many Eyes on the Data

2 minute read

Published:

TL;DR: Multi-Head Attention runs several self-attention operations in parallel, each in a smaller subspace. Each "head" independently learns what to attend to, capturing different aspects of the input. Their outputs are concatenated and projected back.

Why One Head Isn’t Enough

A single self-attention head computes one set of relevance scores across all token pairs. But language is rich: in one sentence, “bank” might need to attend to “river” for its meaning AND to “withdrew” for its syntactic role — simultaneously.

With a single head, the model must average these signals into one distribution, losing specificity. Multiple heads solve this by each specialising in a different type of relationship.

The Idea: Parallel Subspaces

Instead of computing attention once in the full d-dimensional space, Multi-Head Attention:

  1. Splits the Q, K, V matrices into h smaller pieces (each of dimension d/h).
  2. Runs scaled dot-product attention independently on each piece (each piece = one “head”).
  3. Concatenates the h output matrices.
  4. Projects the concatenated result back to the original dimension with a final weight matrix W_O.
Input X Linear (W₁) Attention Head 1 Focus: syntax Linear (W₂) Attention Head 2 Focus: semantics Linear (Wₕ) Attention Head h Focus: co-reference Concat (head₁ ‖ head₂ ‖ … ‖ headₕ) Linear projection W_O Multi-Head Attention Output
Figure 1: h attention heads run in parallel, each with its own learned projection. Their outputs are concatenated and projected to the original dimension.

In Numbers

The original “Attention Is All You Need” paper uses:

  • Model dimension: d_model = 512
  • Number of heads: h = 8
  • Head dimension: d_k = d_v = 512 / 8 = 64

So each head works in a 64-dimensional subspace — much cheaper per head, but collectively richer than a single 512-dim head.

The formula is simply:

MultiHead(Q, K, V) = Concat(head₁, …, headₕ) · W_O

where headᵢ = Attention(Q·Wᵢ_Q, K·Wᵢ_K, V·Wᵢ_V)

What Each Head Learns

Research on attention visualisation (e.g., BERTology papers) shows that different heads naturally specialise:

  • Some heads track syntactic dependencies (subject–verb agreement).
  • Some heads track co-reference (resolving “it” → “animal”).
  • Some heads track positional proximity (attending mostly to adjacent tokens).
  • Some heads look broadly across the whole sequence.

This specialisation emerges from training; nobody explicitly assigns these roles.

Efficiency Note

The total compute is the same as one big attention head (d² operations), but split across h heads. GPUs parallelise this well because each head is independent. The final W_O projection is the only cross-head interaction.

✅ Key Takeaways

  • Multi-Head Attention runs h independent attention operations in lower-dimensional subspaces.
  • Each head learns a different set of Q, K, V projections — and tends to specialise in different relationship types.
  • Outputs are concatenated and projected back to d_model with a final linear layer W_O.
  • Total compute ≈ single-head attention; expressive power is strictly greater.