Positional Encodings: Why Position Matters

2 minute read

Published:

TL;DR: Because self-attention is order-agnostic, Transformers need an extra signal to know which token is at which position. Positional encodings (PEs) inject this information as vectors added to the token embeddings. Different PE designs have wildly different properties.

The Order-Agnostic Problem

Self-attention computes pairwise scores between all tokens. It doesn’t matter if token A is first or last — the attention equation treats both identically. Shuffle the sentence and the model gets the exact same output (just with rows permuted).

This is catastrophic for language: “dog bites man” and “man bites dog” have opposite meanings.

The Solution: Inject Position into the Embedding

The fix is conceptually simple: before the first attention layer, add a position-dependent vector to each token’s embedding.

final_input[pos] = token_embedding[pos] + positional_encoding[pos]

The attention mechanism then sees the mixed vector and can pick up position information from it. Simple. But the choice of what those position vectors are turns out to matter a lot.

Word Embedding "cat" → [0.3, −0.7, 0.1, ...] dim: d_model + Positional Encoding pos=1 → [0.0, 1.0, 0.0, ...] same dim: d_model element-wise add Input to Transformer: [0.3, 0.3, 0.1, ...] carries both word identity + position Types of Positional Encodings: Sinusoidal Learned Relative RoPE ALiBi
Figure 1: Positional encoding is added element-wise to the token embedding before the first attention layer.

The Landscape of PE Methods

MethodTypeLearnable?Extrapolates?Used in
SinusoidalAbsoluteNoModerateOriginal Transformer (2017)
Learned AbsoluteAbsoluteYesNoBERT, GPT-1, ViT
Relative (Shaw)RelativeYesYesMusic Transformer
Relative (T5 Bias)RelativeYesYesT5, Flan-T5
RoPERotary (Absolute→Relative)NoGoodLLaMA, Mistral, GPT-NeoX
ALiBiAttention biasNoExcellentBLOOM, MPT

Three Axes to Understand PEs

1. Absolute vs. Relative Absolute methods assign a vector to each position index (0, 1, 2, …). Relative methods instead encode the distance between two tokens (±1, ±2, …). Relative encodings tend to generalise better across lengths.

2. Fixed vs. Learned Fixed methods (sinusoidal, ALiBi) use a deterministic formula — no extra parameters. Learned methods (BERT-style, relative biases) train position representations end-to-end. Learned = more flexible; fixed = no max-length constraint.

3. Extrapolation Can the model handle sequences longer than those seen during training? This is the key practical question for LLMs serving long documents. ALiBi and RoPE generally win here; standard learned absolute PEs fail badly.

✅ Key Takeaways

  • Self-attention is order-agnostic; PEs inject position information as vectors added to token embeddings.
  • The main design axes are: absolute vs. relative, fixed vs. learned, extrapolation capability.
  • Modern LLMs (LLaMA, Mistral, BLOOM) moved away from sinusoidal PEs toward RoPE and ALiBi.
  • Each subsequent chapter covers one PE method in depth — start with sinusoidal to understand the origin.