Relative Positional Encodings: It’s All About Distance
Published:
The Problem with Absolute Position
Absolute PEs assign a position vector based on where a token sits in the sequence — position 0, position 1, etc.
But think about what really matters for attention: whether token A is close to or far from token B, not exactly where either one sits in a global index. The word “love” in “I love dogs” and “I really truly love dogs” has the same relationship to “dogs” (it’s the verb, directly preceding the object) — even though its absolute position is different.
Relative PE captures exactly this intuition.
Shaw et al. (2018): Relative Attention
Shaw, Uszkoreit, and Vaswani modify the attention score between token i and token j to include a learned relative position embedding a_{ij}:
Here a_{ij} is the embedding for the clipped relative distance clip(i − j, −k, k). A maximum distance k (e.g., 16) is used — beyond that, all distances share the same embedding.
T5 Relative Bias
Raffel et al. (T5, 2020) simplify further. Instead of a full vector per relative position, they add a learned scalar bias to the attention score:
b(·) is a small lookup table of scalars, indexed by bucketed distances. Nearby distances (−1, 0, +1) each get their own bucket; farther distances share buckets. The biases are shared across all layers but learned separately per attention head.
This is extremely memory-efficient and generalises gracefully to longer sequences.
Why Relative PE Generalises Better
Absolute PE puts a token at “position 42” — if training sequences were at most 64 long, the model learned what position 42 means. At position 200? It never saw that index.
Relative PE never mentions global positions. It only says “these two tokens are 5 apart.” As long as the model has seen pairs 5 apart before (which it almost certainly has), it can handle any sequence length.
✅ Key Takeaways
- Relative PE encodes the gap between pairs of tokens, not their absolute position.
- Shaw et al. (2018) adds a learned vector to the QK dot-product; T5 adds a simpler scalar bias.
- Generalises better to longer sequences — the model never sees an "unseen absolute position".
- T5-style relative biases are lightweight (one small table) and used across Flan-T5, Switch Transformer, and more.
