Scaled Dot-Product Attention: Why the √d Matters

3 minute read

Published:

TL;DR: Without the √d_k scaling factor, dot products grow large in high dimensions → softmax outputs near 0 or 1 everywhere → gradients vanish and training stalls. Dividing by √d_k keeps dot products well-conditioned regardless of model size.

The Formula

Scaled dot-product attention is the engine inside every Transformer. Given queries Q, keys K, and values V:

Attention(Q, K, V) = softmax( Q Kᵀ / √d_k ) · V

The term d_k is the dimension of the key vectors. The division by √d_k is the “scaling” in the name. It looks minor. It is not.

Why Dot Products Grow in High Dimensions

Imagine two vectors q and k, each of dimension d_k, with components independently drawn from a standard normal distribution (mean 0, variance 1).

Their dot product q · k = Σᵢ qᵢkᵢ has:

  • Mean = 0 (products of zero-mean variables)
  • Variance = d_k (sum of d_k independent unit-variance terms)

So the standard deviation of the dot product grows as √d_k.

When d_k = 64 (a typical value), individual dot products have std ≈ 8. When d_k = 512, std ≈ 22. As dimensions scale up, raw dot products naturally take on large absolute values.

What Happens to Softmax with Large Inputs

Softmax is defined as:

softmax(xᵢ) = exp(xᵢ) / Σⱼ exp(xⱼ)

When inputs are large — say the vector [35, 2, -10, 1] — the exponential function amplifies differences exponentially. The largest value dominates completely. The output becomes something like [≈1.0, ≈0.0, ≈0.0, ≈0.0].

This is called softmax saturation. The “soft” maximum collapses into a hard argmax.

The Gradient Problem

Softmax saturation is catastrophic for learning because it causes gradient death. The gradient of softmax with respect to its input is:

∂softmax(xᵢ)/∂xᵢ = softmax(xᵢ) · (1 − softmax(xᵢ))

When softmax(xᵢ) ≈ 1, this gradient ≈ 1·(1−1) = 0.
When softmax(xᵢ) ≈ 0, this gradient ≈ 0·(1−0) = 0.

In both cases: no gradient flows. No learning happens. The attention weights are stuck.

The Fix: Divide by √d_k

Dividing each dot product by √d_k scales the variance back to 1:

Var( q·k / √d_k ) = Var(q·k) / d_k = d_k / d_k = 1

Now the inputs to softmax live in a reasonable range regardless of d_k. Softmax operates in its smooth, differentiable regime. Gradients flow. Learning works.

Key insight: The √d_k term is not a hyperparameter to tune. It is a mathematical consequence of how dot product variance scales with dimension. It keeps the model trainable as d_k grows.

Visualising the Effect

d_kRaw std(q·k)Scaled stdSoftmax regime
421Smooth
6481Smooth
51222.61Smooth
512 (unscaled)22.622.6Saturated

Without scaling, increasing model width makes attention increasingly broken.

What About Other Scaling Choices?

Why √d_k specifically, and not d_k or some learned parameter?

  • ÷ d_k: over-shrinks; dot products become too small, softmax becomes too uniform (near-equal weights, no sharp attention)
  • ÷ √d_k: correct normalization that restores unit variance
  • Learned scale: works in practice (some models do this), but adds parameters and can be poorly initialised

The √d_k formula hits the theoretical optimum for variance normalisation with minimal complexity.

Summary

Without scalingWith scaling
Dot products grow as √d_kDot products stay ~O(1)
Softmax saturatesSoftmax is smooth
Gradients vanishGradients flow cleanly
Larger models train worseLarger models train fine

The √d_k is a single division that makes Transformers scalable. It is easy to overlook, but foundational to why the architecture works at all.