XPos: Length-Extrapolatable Rotary Embeddings

5 minute read

Published: May 29, 2026

TL;DR: XPos keeps RoPE's rotation idea, but adds a distance-aware scaling factor so attention strength does not drift as positions get farther apart. The goal is simple: preserve RoPE's elegant relative geometry while making it extrapolate more gracefully to longer contexts.

Paper: "A Length-Extrapolatable Transformer" · arXiv:2212.10554
Authors: Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei
Venue: ACL 2023 · 📄 Read the paper

First page of the A Length-Extrapolatable Transformer paper — Paper preview — A Length-Extrapolatable Transformer (Sun et al., 2022).

Figure 1 — XPos starts from RoPE's rotating-query-and-key picture, but adds a scale correction so long-distance interactions remain numerically better behaved. The aim is not to replace rotary embeddings, but to make them stretch farther before attention quality drifts. Source: [1].

The key intuition: RoPE handles relative position through phase. XPos keeps that phase relation, but also adjusts amplitude so the effective dot products between far-away tokens do not become poorly calibrated.

Intuition First: Rotation Angle vs Dot-Product Magnitude

Standard RoPE encodes position purely through rotation angle. Two tokens at positions m and n produce a dot-product that depends on their relative angle (m−n)·θ. But the magnitude of the dot-product — how large those attention logits are — is not controlled by RoPE at all.

At short distances, the model has learned what magnitudes to expect. At long distances (beyond training length), the angles are unfamiliar AND the magnitudes can drift, making logits poorly calibrated.

XPos adds a multiplicative envelope α^m to the query and α^{−n} to the key. When the dot product is computed, the magnitude term contributes α^(m−n) — a smooth decay that grows with relative distance. The model now has a controlled signal for “how far away is this token?” in amplitude, not just in phase.

RoPE (left): each position rotates the vector by a different angle but the vector length stays constant — the attention logit magnitude is uncontrolled at long range. XPos (right): position also scales the vector length, so distant tokens produce smoothly decaying dot products regardless of rotation angle — better-calibrated logits at long context.

Why RoPE Still Struggles at Long Range

RoPE is elegant because relative position emerges from rotating the query and key vectors by position-dependent angles. But when sequence length grows far beyond training, those rotations can still become hard for the model to use reliably. The issue is not only phase wrapping. It is also that attention scores at large distances become less well-conditioned.

So the question behind XPos is: can we keep RoPE’s relative-position geometry, but make its long-range behaviour numerically more stable?

The Core Modification

Standard RoPE rotates each 2D query-key pair by an angle that depends on token position. XPos applies the same rotation idea, but introduces a position-dependent scale term. In simplified form:

\[ \tilde{q}_m = \alpha^m R_m q, \qquad \tilde{k}_n = \alpha^{-n} R_n k \]

where:

(R_m) is the usual RoPE rotation at position (m)
(\alpha) is a learned or fixed scale base close to 1

The important property is that the relative phase structure is preserved, but the magnitude now changes with distance in a controlled way.

What Problem This Solves

With plain RoPE, extending sequence length can distort the effective distribution of attention logits. XPos tries to counteract that by making long-range interactions decay in a smoother, more stable way.

In practice, that means:

better length extrapolation than plain RoPE in some settings
less brittle long-context behaviour
a small change to the positional mechanism, not a full architectural rewrite

How to Think About XPos

If RoPE says, “position is a rotation angle,” then XPos says:

position is mostly a rotation angle, but distance should also slightly reweight how strongly those rotated vectors interact.

That extra degree of control is what makes XPos interesting. It is still fundamentally a rotary method, but it acknowledges that angle alone is not always enough when context length grows.

When It Is Useful

XPos is most useful when:

you already like RoPE’s relative-position behaviour
you want better extrapolation without switching to a completely different scheme
you care about long context, but do not want to redesign the attention mechanism

It is less famous than YaRN or LongRoPE in today’s LLM tooling, but conceptually it is one of the cleanest “make RoPE more stable” ideas.

✅ Key Takeaways

XPos is a rotary embedding extension, not a brand-new positional family.
It keeps RoPE's relative rotation structure but adds distance-aware scaling.
The goal is better length extrapolation and better-behaved attention logits at long range.
You can read it as a principled "RoPE, but more stable" design.

References

[1] Sun, Y., Dong, L., Huang, S., Ma, S., Xia, F., Wang, S., Xue, J., Chen, J., Wei, F. (2022). A Length-Extrapolatable Transformer. arXiv 2022.
[2] Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

XPos: Length-Extrapolatable Rotary Embeddings

Intuition First: Rotation Angle vs Dot-Product Magnitude

Why RoPE Still Struggles at Long Range

The Core Modification

What Problem This Solves

How to Think About XPos

When It Is Useful

✅ Key Takeaways

References

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

Intuition First: Rotation Angle vs Dot-Product Magnitude

Why RoPE Still Struggles at Long Range

The Core Modification

What Problem This Solves

How to Think About XPos

When It Is Useful

✅ Key Takeaways

References

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization