Sinusoidal Positional Encodings: The Original Solution
Published:
The Formula
For a token at position pos, dimension i of its PE vector is:
PE(pos, 2i+1) = cos( pos / 100002i/d )
That’s it. Even dimensions get a sine, odd dimensions get a cosine. The frequency shrinks geometrically as i increases.
The Intuition: A Continuous Binary Counter
Think of binary numbers: 0001, 0010, 0011, 0100, … The rightmost bit flips every step (high frequency); the leftmost bit flips rarely (low frequency). Together they uniquely identify each integer.
Sinusoidal PE does the same in a continuous, smooth way:
- High dimensions (i small → high frequency): the sin/cos oscillates rapidly, capturing fine-grained position differences.
- Low dimensions (i large → low frequency): the sin/cos changes slowly, encoding coarse position.
Each position gets a unique fingerprint — a mix of fast and slow oscillations — that the model can read.
Three Key Properties
1. Uniqueness. The combination of many frequencies produces a unique vector for each position — like a fingerprint. Two positions will never have the same PE vector.
2. Smooth transitions. Adjacent positions have similar PE vectors. The model can learn that nearby positions are related without any explicit guidance.
3. Relative encoding via dot products. The dot product PE(pos₁) · PE(pos₂) depends only on the distance pos₁ − pos₂. This means the model can implicitly reason about relative distances from absolute positions — a crucial and non-obvious property.
Why Use 10000?
The base 10000 is chosen so that the wavelengths span from 2π (highest frequency, dim 0) to 10000·2π (lowest frequency, last dim). This gives the model coverage over positions from 1 to roughly 10,000 tokens — sufficient for most early use cases.
Limitations
- Fixed formula, so it can’t be fine-tuned for a specific task.
- Extrapolation beyond the training length is imperfect, though better than learned absolute PEs.
- Modern LLMs (with 128K+ context windows) need better solutions — enter RoPE and ALiBi.
✅ Key Takeaways
- Sinusoidal PE uses sin/cos at geometrically decreasing frequencies to build unique position fingerprints.
- No parameters — fully deterministic and requires no training.
- Adjacent positions have similar encodings; the dot product encodes relative distance implicitly.
- Works well for sequences up to ~10K tokens; modern LLMs prefer RoPE or ALiBi for longer contexts.
