Learned Positional Encodings: Data-Driven Position

3 minute read

Published: May 26, 2026

TL;DR: Learned PE keeps a trainable embedding matrix where row i is the position vector for position i. It's flexible and often slightly outperforms sinusoidal PE on benchmark tasks — but it can't generalise to sequences longer than seen during training.

Trade-off: learned absolute embeddings are flexible and easy to train, but they tie the model to the position range it has actually seen.

Intuition First: Position as a Word

Think of each position index as a separate “token” in its own mini-vocabulary. Just as a word embedding table has one row per word, a position embedding table has one row per position slot. During training, gradient descent shapes those rows into whatever vectors are most useful for the task.

The result may look nothing like sinusoidal waves — the model is free to encode position however it finds helpful, including non-monotonic patterns.

The Simplest Possible Idea

Word embeddings map each token in the vocabulary to a learned vector. Learned PE does exactly the same thing for positions.

You create an embedding matrix E of shape [max_length × d_model]. During training, E[pos] is trained alongside all other model parameters via backpropagation. At inference, you look up the row matching the token’s position and add it to the word embedding.

input[pos] = word_embedding(token[pos]) + E[pos]

That’s it. No formula, no frequencies — just a trainable lookup table.

Figure 1: Learned PE is a simple lookup table trained end-to-end. Row i is the position vector for position i. Sequences longer than the table length cannot be handled.

Who Uses It?

BERT (2018): 512 position limit, learned embeddings. The most influential NLP model of its era.
GPT-1 (2018): 512 positions, learned.
GPT-2 (2019): 1024 positions, learned.
ViT (2020): Patches are treated as tokens, learned 1D or 2D PE.

Pros and Cons

✅ Advantages

Flexible — learns what works best for the data
Simple to implement (one embedding layer)
Often matches or slightly beats sinusoidal on standard benchmarks
The model can shape position representations to the task

❌ Disadvantages

Cannot generalise beyond the training length
Adds parameters proportional to max sequence length
Position 512 might be poorly trained if few training examples are that long
Less interpretable than a fixed formula

Sinusoidal vs. Learned: Which Is Better?

The original Transformer paper tested both and found “roughly equal results”. The key distinction is use case:

If your sequences are bounded and short → learned PE is fine.
If you need unlimited extrapolation → sinusoidal, RoPE, or ALiBi are better.

Modern large-scale LLMs abandoned both in favour of RoPE or ALiBi, which combine the benefits of learned representations with better extrapolation.

✅ Key Takeaways

Learned PE is a trainable embedding table: one row per position, trained end-to-end.
Used in BERT, GPT-1/2, and early ViT — simple and effective for bounded-length tasks.
The main weakness: no generalisation beyond the maximum training length.
Slightly more expressive than sinusoidal, but modern LLMs prefer RoPE or ALiBi for long contexts.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Learned Positional Encodings: Data-Driven Position

Intuition First: Position as a Word

The Simplest Possible Idea

Who Uses It?

Pros and Cons

✅ Advantages

❌ Disadvantages

Sinusoidal vs. Learned: Which Is Better?

✅ Key Takeaways

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

Intuition First: Position as a Word

The Simplest Possible Idea

Who Uses It?

Pros and Cons

✅ Advantages

❌ Disadvantages

Sinusoidal vs. Learned: Which Is Better?

✅ Key Takeaways

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization