Relative Positional Encodings: It’s All About Distance

3 minute read

Published:

TL;DR: Relative PE encodes how far apart two tokens are rather than where each one sits. Shaw et al. (2018) add learnable distance vectors to the attention computation; T5 adds a simpler learned scalar bias per distance bucket. Both approaches generalise better than absolute PE.

The Problem with Absolute Position

Absolute PEs assign a position vector based on where a token sits in the sequence — position 0, position 1, etc.

But think about what really matters for attention: whether token A is close to or far from token B, not exactly where either one sits in a global index. The word “love” in “I love dogs” and “I really truly love dogs” has the same relationship to “dogs” (it’s the verb, directly preceding the object) — even though its absolute position is different.

Relative PE captures exactly this intuition.

Shaw et al. (2018): Relative Attention

Shaw, Uszkoreit, and Vaswani modify the attention score between token i and token j to include a learned relative position embedding a_{ij}:

score(i, j) = (q_i · k_j + q_i · a_{ij}) / √d_k

Here a_{ij} is the embedding for the clipped relative distance clip(i − j, −k, k). A maximum distance k (e.g., 16) is used — beyond that, all distances share the same embedding.

Relative Distance Between Token Pairs I love dogs and cats anchor: "dogs" dist = −2 dist = −1 dist = +1 dist = +2 Key insight: same relative distances in any sentence position "love" → "dogs" is always distance +1, whether at position 3, 7, or 25 globally. Shaw et al. (2018) Learned vector a_{ij} added to QK attention Also modifies V computation T5 Relative Bias (2020) Learned scalar bias b_{ij} added to attention logits Simpler, buckets for distance, shared across layers
Figure 1: "dogs" attending to tokens at relative distances −2, −1, +1, +2. The same distance embeddings apply regardless of absolute position.

T5 Relative Bias

Raffel et al. (T5, 2020) simplify further. Instead of a full vector per relative position, they add a learned scalar bias to the attention score:

score(i, j) = q_i · k_j / √d_k + b(i − j)

b(·) is a small lookup table of scalars, indexed by bucketed distances. Nearby distances (−1, 0, +1) each get their own bucket; farther distances share buckets. The biases are shared across all layers but learned separately per attention head.

This is extremely memory-efficient and generalises gracefully to longer sequences.

Why Relative PE Generalises Better

Absolute PE puts a token at “position 42” — if training sequences were at most 64 long, the model learned what position 42 means. At position 200? It never saw that index.

Relative PE never mentions global positions. It only says “these two tokens are 5 apart.” As long as the model has seen pairs 5 apart before (which it almost certainly has), it can handle any sequence length.

✅ Key Takeaways

  • Relative PE encodes the gap between pairs of tokens, not their absolute position.
  • Shaw et al. (2018) adds a learned vector to the QK dot-product; T5 adds a simpler scalar bias.
  • Generalises better to longer sequences — the model never sees an "unseen absolute position".
  • T5-style relative biases are lightweight (one small table) and used across Flan-T5, Switch Transformer, and more.