Positional Encodings: Why Position Matters

2 minute read

Published: January 04, 2024

TL;DR: Because self-attention is order-agnostic, Transformers need an extra signal to know which token is at which position. Positional encodings (PEs) inject this information as vectors added to the token embeddings. Different PE designs have wildly different properties.

The Order-Agnostic Problem

Self-attention computes pairwise scores between all tokens. It doesn’t matter if token A is first or last — the attention equation treats both identically. Shuffle the sentence and the model gets the exact same output (just with rows permuted).

This is catastrophic for language: “dog bites man” and “man bites dog” have opposite meanings.

The Solution: Inject Position into the Embedding

The fix is conceptually simple: before the first attention layer, add a position-dependent vector to each token’s embedding.

final_input[pos] = token_embedding[pos] + positional_encoding[pos]

The attention mechanism then sees the mixed vector and can pick up position information from it. Simple. But the choice of what those position vectors are turns out to matter a lot.

Figure 1: Positional encoding is added element-wise to the token embedding before the first attention layer.

The Landscape of PE Methods

Method	Type	Learnable?	Extrapolates?	Used in
Sinusoidal	Absolute	No	Moderate	Original Transformer (2017)
Learned Absolute	Absolute	Yes	No	BERT, GPT-1, ViT
Relative (Shaw)	Relative	Yes	Yes	Music Transformer
Relative (T5 Bias)	Relative	Yes	Yes	T5, Flan-T5
RoPE	Rotary (Absolute→Relative)	No	Good	LLaMA, Mistral, GPT-NeoX
ALiBi	Attention bias	No	Excellent	BLOOM, MPT

Three Axes to Understand PEs

1. Absolute vs. Relative Absolute methods assign a vector to each position index (0, 1, 2, …). Relative methods instead encode the distance between two tokens (±1, ±2, …). Relative encodings tend to generalise better across lengths.

2. Fixed vs. Learned Fixed methods (sinusoidal, ALiBi) use a deterministic formula — no extra parameters. Learned methods (BERT-style, relative biases) train position representations end-to-end. Learned = more flexible; fixed = no max-length constraint.

3. Extrapolation Can the model handle sequences longer than those seen during training? This is the key practical question for LLMs serving long documents. ALiBi and RoPE generally win here; standard learned absolute PEs fail badly.

✅ Key Takeaways

Self-attention is order-agnostic; PEs inject position information as vectors added to token embeddings.
The main design axes are: absolute vs. relative, fixed vs. learned, extrapolation capability.
Modern LLMs (LLaMA, Mistral, BLOOM) moved away from sinusoidal PEs toward RoPE and ALiBi.
Each subsequent chapter covers one PE method in depth — start with sinusoidal to understand the origin.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Positional Encodings: Why Position Matters

The Order-Agnostic Problem

The Solution: Inject Position into the Embedding

The Landscape of PE Methods

Three Axes to Understand PEs

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks