Learned Positional Encodings: Data-Driven Position

2 minute read

Published:

TL;DR: Learned PE keeps a trainable embedding matrix where row i is the position vector for position i. It's flexible and often slightly outperforms sinusoidal PE on benchmark tasks — but it can't generalise to sequences longer than seen during training.

The Simplest Possible Idea

Word embeddings map each token in the vocabulary to a learned vector. Learned PE does exactly the same thing for positions.

You create an embedding matrix E of shape [max_length × d_model]. During training, E[pos] is trained alongside all other model parameters via backpropagation. At inference, you look up the row matching the token’s position and add it to the word embedding.

input[pos] = word_embedding(token[pos]) + E[pos]

That’s it. No formula, no frequencies — just a trainable lookup table.

Learned Position Embedding Matrix E Position Embedding Vector (d_model dims) pos = 0 [0.12, −0.45, 0.87, ...] ← trainable pos = 1 [−0.33, 0.21, 0.55, ...] ← trainable pos = 2 [0.77, 0.03, −0.12, ...] ← trainable pos = T [0.41, −0.67, 0.22, ...] max length! ⚠ Cannot generalise beyond pos = T No row exists for pos = T+1, T+2, … Lookup E[pos] Add to word embedding
Figure 1: Learned PE is a simple lookup table trained end-to-end. Row i is the position vector for position i. Sequences longer than the table length cannot be handled.

Who Uses It?

  • BERT (2018): 512 position limit, learned embeddings. The most influential NLP model of its era.
  • GPT-1 (2018): 512 positions, learned.
  • GPT-2 (2019): 1024 positions, learned.
  • ViT (2020): Patches are treated as tokens, learned 1D or 2D PE.

Pros and Cons

✅ Advantages

  • Flexible — learns what works best for the data
  • Simple to implement (one embedding layer)
  • Often matches or slightly beats sinusoidal on standard benchmarks
  • The model can shape position representations to the task

❌ Disadvantages

  • Cannot generalise beyond the training length
  • Adds parameters proportional to max sequence length
  • Position 512 might be poorly trained if few training examples are that long
  • Less interpretable than a fixed formula

Sinusoidal vs. Learned: Which Is Better?

The original Transformer paper tested both and found “roughly equal results”. The key distinction is use case:

  • If your sequences are bounded and short → learned PE is fine.
  • If you need unlimited extrapolation → sinusoidal, RoPE, or ALiBi are better.

Modern large-scale LLMs abandoned both in favour of RoPE or ALiBi, which combine the benefits of learned representations with better extrapolation.

✅ Key Takeaways

  • Learned PE is a trainable embedding table: one row per position, trained end-to-end.
  • Used in BERT, GPT-1/2, and early ViT — simple and effective for bounded-length tasks.
  • The main weakness: no generalisation beyond the maximum training length.
  • Slightly more expressive than sinusoidal, but modern LLMs prefer RoPE or ALiBi for long contexts.