Sinusoidal Positional Encodings: The Original Solution

3 minute read

Published:

TL;DR: Sinusoidal PE assigns each position a unique vector made of alternating sin/cos values at geometrically spaced frequencies. It requires no training, generalises gracefully, and was the default for early Transformers.

The Formula

For a token at position pos, dimension i of its PE vector is:

PE(pos, 2i) = sin( pos / 100002i/d )
PE(pos, 2i+1) = cos( pos / 100002i/d )

That’s it. Even dimensions get a sine, odd dimensions get a cosine. The frequency shrinks geometrically as i increases.

The Intuition: A Continuous Binary Counter

Think of binary numbers: 0001, 0010, 0011, 0100, … The rightmost bit flips every step (high frequency); the leftmost bit flips rarely (low frequency). Together they uniquely identify each integer.

Sinusoidal PE does the same in a continuous, smooth way:

  • High dimensions (i small → high frequency): the sin/cos oscillates rapidly, capturing fine-grained position differences.
  • Low dimensions (i large → low frequency): the sin/cos changes slowly, encoding coarse position.

Each position gets a unique fingerprint — a mix of fast and slow oscillations — that the model can read.

Sinusoidal PE heatmap (10 positions × 16 dims) position → dimension → 0 1 2 3 4 5 6 7 8 9 0 2 4 6 8 10 12 14 +1 (max sin/cos) ~0 −1 (min) High freq (left cols) oscillate fast. Low freq (right cols) change slowly.
Figure 1: Sinusoidal PE heatmap. Each row is a position; each column is a dimension. Left columns (high frequency) alternate rapidly; right columns (low frequency) stay nearly constant.

Three Key Properties

1. Uniqueness. The combination of many frequencies produces a unique vector for each position — like a fingerprint. Two positions will never have the same PE vector.

2. Smooth transitions. Adjacent positions have similar PE vectors. The model can learn that nearby positions are related without any explicit guidance.

3. Relative encoding via dot products. The dot product PE(pos₁) · PE(pos₂) depends only on the distance pos₁ − pos₂. This means the model can implicitly reason about relative distances from absolute positions — a crucial and non-obvious property.

Why Use 10000?

The base 10000 is chosen so that the wavelengths span from 2π (highest frequency, dim 0) to 10000·2π (lowest frequency, last dim). This gives the model coverage over positions from 1 to roughly 10,000 tokens — sufficient for most early use cases.

Limitations

  • Fixed formula, so it can’t be fine-tuned for a specific task.
  • Extrapolation beyond the training length is imperfect, though better than learned absolute PEs.
  • Modern LLMs (with 128K+ context windows) need better solutions — enter RoPE and ALiBi.

✅ Key Takeaways

  • Sinusoidal PE uses sin/cos at geometrically decreasing frequencies to build unique position fingerprints.
  • No parameters — fully deterministic and requires no training.
  • Adjacent positions have similar encodings; the dot product encodes relative distance implicitly.
  • Works well for sequences up to ~10K tokens; modern LLMs prefer RoPE or ALiBi for longer contexts.