Relative Positional Encodings: It’s All About Distance

3 minute read

Published: January 07, 2024

TL;DR: Relative PE encodes how far apart two tokens are rather than where each one sits. Shaw et al. (2018) add learnable distance vectors to the attention computation; T5 adds a simpler learned scalar bias per distance bucket. Both approaches generalise better than absolute PE.

The Problem with Absolute Position

Absolute PEs assign a position vector based on where a token sits in the sequence — position 0, position 1, etc.

But think about what really matters for attention: whether token A is close to or far from token B, not exactly where either one sits in a global index. The word “love” in “I love dogs” and “I really truly love dogs” has the same relationship to “dogs” (it’s the verb, directly preceding the object) — even though its absolute position is different.

Relative PE captures exactly this intuition.

Shaw et al. (2018): Relative Attention

Shaw, Uszkoreit, and Vaswani modify the attention score between token i and token j to include a learned relative position embedding a_{ij}:

score(i, j) = (q_i · k_j + q_i · a_{ij}) / √d_k

Here a_{ij} is the embedding for the clipped relative distance clip(i − j, −k, k). A maximum distance k (e.g., 16) is used — beyond that, all distances share the same embedding.

Figure 1: "dogs" attending to tokens at relative distances −2, −1, +1, +2. The same distance embeddings apply regardless of absolute position.

T5 Relative Bias

Raffel et al. (T5, 2020) simplify further. Instead of a full vector per relative position, they add a learned scalar bias to the attention score:

score(i, j) = q_i · k_j / √d_k + b(i − j)

b(·) is a small lookup table of scalars, indexed by bucketed distances. Nearby distances (−1, 0, +1) each get their own bucket; farther distances share buckets. The biases are shared across all layers but learned separately per attention head.

This is extremely memory-efficient and generalises gracefully to longer sequences.

Why Relative PE Generalises Better

Absolute PE puts a token at “position 42” — if training sequences were at most 64 long, the model learned what position 42 means. At position 200? It never saw that index.

Relative PE never mentions global positions. It only says “these two tokens are 5 apart.” As long as the model has seen pairs 5 apart before (which it almost certainly has), it can handle any sequence length.

✅ Key Takeaways

Relative PE encodes the gap between pairs of tokens, not their absolute position.
Shaw et al. (2018) adds a learned vector to the QK dot-product; T5 adds a simpler scalar bias.
Generalises better to longer sequences — the model never sees an "unseen absolute position".
T5-style relative biases are lightweight (one small table) and used across Flan-T5, Switch Transformer, and more.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Relative Positional Encodings: It’s All About Distance

The Problem with Absolute Position

Shaw et al. (2018): Relative Attention

T5 Relative Bias

Why Relative PE Generalises Better

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks