Positional Encodings: Why Position Matters
Published:
The Order-Agnostic Problem
Self-attention computes pairwise scores between all tokens. It doesn’t matter if token A is first or last — the attention equation treats both identically. Shuffle the sentence and the model gets the exact same output (just with rows permuted).
This is catastrophic for language: “dog bites man” and “man bites dog” have opposite meanings.
The Solution: Inject Position into the Embedding
The fix is conceptually simple: before the first attention layer, add a position-dependent vector to each token’s embedding.
final_input[pos] = token_embedding[pos] + positional_encoding[pos]
The attention mechanism then sees the mixed vector and can pick up position information from it. Simple. But the choice of what those position vectors are turns out to matter a lot.
The Landscape of PE Methods
| Method | Type | Learnable? | Extrapolates? | Used in |
|---|---|---|---|---|
| Sinusoidal | Absolute | No | Moderate | Original Transformer (2017) |
| Learned Absolute | Absolute | Yes | No | BERT, GPT-1, ViT |
| Relative (Shaw) | Relative | Yes | Yes | Music Transformer |
| Relative (T5 Bias) | Relative | Yes | Yes | T5, Flan-T5 |
| RoPE | Rotary (Absolute→Relative) | No | Good | LLaMA, Mistral, GPT-NeoX |
| ALiBi | Attention bias | No | Excellent | BLOOM, MPT |
Three Axes to Understand PEs
1. Absolute vs. Relative Absolute methods assign a vector to each position index (0, 1, 2, …). Relative methods instead encode the distance between two tokens (±1, ±2, …). Relative encodings tend to generalise better across lengths.
2. Fixed vs. Learned Fixed methods (sinusoidal, ALiBi) use a deterministic formula — no extra parameters. Learned methods (BERT-style, relative biases) train position representations end-to-end. Learned = more flexible; fixed = no max-length constraint.
3. Extrapolation Can the model handle sequences longer than those seen during training? This is the key practical question for LLMs serving long documents. ALiBi and RoPE generally win here; standard learned absolute PEs fail badly.
✅ Key Takeaways
- Self-attention is order-agnostic; PEs inject position information as vectors added to token embeddings.
- The main design axes are: absolute vs. relative, fixed vs. learned, extrapolation capability.
- Modern LLMs (LLaMA, Mistral, BLOOM) moved away from sinusoidal PEs toward RoPE and ALiBi.
- Each subsequent chapter covers one PE method in depth — start with sinusoidal to understand the origin.
