Learned Positional Encodings: Data-Driven Position
Published:
The Simplest Possible Idea
Word embeddings map each token in the vocabulary to a learned vector. Learned PE does exactly the same thing for positions.
You create an embedding matrix E of shape [max_length × d_model]. During training, E[pos] is trained alongside all other model parameters via backpropagation. At inference, you look up the row matching the token’s position and add it to the word embedding.
input[pos] = word_embedding(token[pos]) + E[pos]
That’s it. No formula, no frequencies — just a trainable lookup table.
Who Uses It?
- BERT (2018): 512 position limit, learned embeddings. The most influential NLP model of its era.
- GPT-1 (2018): 512 positions, learned.
- GPT-2 (2019): 1024 positions, learned.
- ViT (2020): Patches are treated as tokens, learned 1D or 2D PE.
Pros and Cons
✅ Advantages
- Flexible — learns what works best for the data
- Simple to implement (one embedding layer)
- Often matches or slightly beats sinusoidal on standard benchmarks
- The model can shape position representations to the task
❌ Disadvantages
- Cannot generalise beyond the training length
- Adds parameters proportional to max sequence length
- Position 512 might be poorly trained if few training examples are that long
- Less interpretable than a fixed formula
Sinusoidal vs. Learned: Which Is Better?
The original Transformer paper tested both and found “roughly equal results”. The key distinction is use case:
- If your sequences are bounded and short → learned PE is fine.
- If you need unlimited extrapolation → sinusoidal, RoPE, or ALiBi are better.
Modern large-scale LLMs abandoned both in favour of RoPE or ALiBi, which combine the benefits of learned representations with better extrapolation.
✅ Key Takeaways
- Learned PE is a trainable embedding table: one row per position, trained end-to-end.
- Used in BERT, GPT-1/2, and early ViT — simple and effective for bounded-length tasks.
- The main weakness: no generalisation beyond the maximum training length.
- Slightly more expressive than sinusoidal, but modern LLMs prefer RoPE or ALiBi for long contexts.
