Self-Attention: Teaching Machines to Focus

3 minute read

Published:

TL;DR: Self-attention lets each token in a sequence ask "who else here is relevant to me?" It uses three learned projections — Query, Key, and Value — to compute a weighted mixture of all token representations in one shot.

The Focusing Analogy

Imagine reading a paper and highlighting sentences that are relevant to your current question. Self-attention does something similar: for every word in a sentence, it calculates a relevance score against every other word, then creates a new representation that is a blend of the most relevant words.

Consider: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? To understand this, the model needs to relate “it” to “animal” (not “street”). Self-attention learns to assign a high score to that pair.

Query, Key, and Value

Self-attention introduces three projections from each token’s embedding vector:

  • Query (Q): “What am I looking for?”
  • Key (K): “What do I contain?”
  • Value (V): “What will I contribute if selected?”

Think of it like a library search system:

  • You submit a Query (your search request).
  • Every book has a Key (its index entry).
  • The relevance between your query and each key determines how much of each book’s Value (its content) you receive.

Each of Q, K, V is produced by multiplying the token’s embedding by a learned weight matrix: Q = X·W_Q, K = X·W_K, V = X·W_V.

The Four-Step Computation

Token Embeddings (X) x₁ x₂ x₃ ① Linear projections W_Q W_K W_V Q, K, V matrices Q K V ② Scores = Q · Kᵀ / √dₖ Attention scores 0.8 0.1 0.1 0.3 0.6 0.1 0.2 0.2 0.6 ③ Softmax (row-wise) ④ Weights × V → Output Contextualised token vectors
Figure 1: The four steps of scaled dot-product attention.

Step 1 — Project: Multiply the input matrix X by three weight matrices to get Q, K, V.

Step 2 — Score: Compute the dot product of every query with every key: Q · Kᵀ. Divide by √dₖ to prevent large values from pushing the softmax into saturation.

Attention(Q, K, V) = softmax( Q · Kᵀ / √dk ) · V

Step 3 — Normalise: Apply softmax row-wise. Each row now sums to 1, giving a probability distribution: “how much should token i attend to token j?”

Step 4 — Mix: Multiply the attention weights by V. Each token’s output is a weighted average of all value vectors — heavily weighted towards the tokens it found most relevant.

Why Divide by √dₖ?

Without scaling, the dot products grow large as dimensionality dₖ increases (because they’re sums of dₖ products). Large values push softmax into regions where gradients are tiny, slowing training. Dividing by √dₖ keeps the variance stable regardless of model size.

What the Scores Represent

The attention matrix has a score for every (query-token, key-token) pair. High score = “I find you useful.” After softmax, these are weights that determine how much each token borrows from each other token when forming its output representation.

Critically, this computation is fully differentiable — the model learns which token pairs should have high attention purely from training signal, with no hand-crafted rules.

✅ Key Takeaways

  • Self-attention projects each token into Q, K, V vectors via learned weight matrices.
  • Relevance scores are dot products of Q and K, scaled by √d_k and normalised with softmax.
  • The output of each token is a weighted sum of all Value vectors.
  • The whole computation runs in parallel across all token pairs — no sequential dependency.