p-RoPE: What Makes Rotary Positional Encodings Useful?

6 minute read

Published: May 29, 2026

TL;DR: This paper argues that RoPE is not mainly useful because it creates distance decay. Instead, trained models exploit high frequencies to build sharp positional attention patterns, while low frequencies are often reused as stable semantic channels. From that observation, the authors propose p-RoPE, which removes the lowest rotary frequencies and can improve performance.

Paper: "Round and Round We Go! What makes Rotary Positional Encodings useful?" · arXiv:2410.06205
Authors: Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veličković
Venue: ICLR 2025 · 📄 Read the paper

First page of the p-RoPE paper — Paper preview — Round and Round We Go! What makes Rotary Positional Encodings useful? (Barbero et al., 2024).

Figure 1 — The p-RoPE proposal comes from a mechanistic reading of RoPE. The authors argue that the highest frequencies are what enable sharp positional attention patterns, while the lowest frequencies are often preferred for semantic usage. Partial RoPE therefore removes those lowest rotary frequencies to preserve more stable semantic channels. Source: [1].

The important conceptual move: this paper is not only proposing a variant. It is also changing the explanation of why RoPE works. The standard story says RoPE helps because attention decays with distance. The paper argues that this is incomplete, and that frequency specialization matters much more.

Worked Example: Frequency Roles in a Trained Model

Consider a trained LLM (e.g., Gemma 7B) with head dimension d = 256, so d/2 = 128 frequency pairs.

High-frequency dimensions (i near 0, short wavelength):

θ₀ = 1 (highest frequency, one full rotation per token)
The attention pattern using this dimension can distinguish position 5 from position 6 precisely
Observed in Gemma 7B: heads that perform positional lookup (e.g., attending to the previous token, or tokens exactly 3 positions back) rely heavily on these high-frequency dimensions

Low-frequency dimensions (i near 64, long wavelength):

θ₆₄ = 1/10000^(64/128) = 1/100 (one rotation per 628 tokens)
These complete less than one cycle within a typical 2k context — the angle barely changes
The paper finds: the model repurposes these near-constant dimensions as stable semantic features (the rotation is so slow it behaves almost like no rotation at all)

p-RoPE intervention:

Standard RoPE rotates all 128 pairs
p-RoPE: drop rotation on the lowest p% of frequency pairs (e.g., the 32 lowest-frequency pairs, i = 97…128)
Those 32 pairs now act as pure semantic channels — no positional interference
The remaining 96 pairs still carry all the positional signal needed

Why this is non-obvious: you might expect that removing any RoPE dimensions would hurt positional encoding. p-RoPE argues the opposite — for the lowest frequencies, the rotation was so slow as to be nearly useless for positions anyway, but was still polluting the semantic signal with a small, unhelpful rotation.

What the Paper Tries to Explain

RoPE had become the default positional encoding for modern LLMs, but the explanation people repeated was often vague: it supposedly helps by making attention weaken with relative distance.

The authors argue that this is not the full story. They inspect a trained Gemma 7B model and study how RoPE is actually used internally. Their main conclusion is much more interesting:

the highest frequencies are used to build precise, robust positional attention patterns
the lowest frequencies are strongly preferred by the model and seem to function more like semantic carriers

That immediately suggests a design question: if low frequencies are often being repurposed semantically, should all frequencies really be rotated?

The p-RoPE Idea

The answer proposed by the paper is partial RoPE, or p-RoPE. Instead of applying rotary embeddings to the full frequency range, you truncate the lowest-frequency part and keep only the more useful positional bands.

In simplified form:

\[ \text{p-RoPE} = \text{RoPE applied only to a selected subset of frequency channels} \]

The point is not to make RoPE weaker. It is to make the roles cleaner:

high frequencies stay available for positional circuitry
the lowest-frequency channels are no longer forced through unnecessary rotations

Why This Makes Sense

If the model already wants to use low frequencies as stable semantic features, rotating them may be counterproductive. By not rotating that part of the space, p-RoPE gives the model a cleaner semantic pathway while preserving the sharper positional machinery where it matters most.

This is why the proposal is interesting even beyond its benchmark gains. It suggests that positional encodings should perhaps be treated less as monolithic blocks and more as frequency-partitioned tools.

What It Adds to the Positional-Encoding Story

This paper is especially valuable because it sits between theory and engineering:

it gives a mechanistic explanation of RoPE behaviour
it derives a practical positional variant from that explanation

So in the broader book sequence, p-RoPE belongs right after RoPE itself and before the long-context extension methods. It is the chapter that asks:

before we extend RoPE, do we even understand which parts of RoPE are doing what?

That makes it one of the most useful “bridge” papers in the series.

Practical Takeaway

You should not think of p-RoPE as “the new default positional encoding.” The more important takeaway is the frequency decomposition insight:

some frequencies are much more useful for explicit positional attention patterns
some are better left to semantic representation

That perspective helps make sense of why later methods like FoPE, NTK scaling, YaRN, and LongRoPE all end up reasoning so heavily about frequency bands.

✅ Key Takeaways

This paper argues that RoPE's value is not mainly simple distance decay.
High frequencies help build sharp positional attention patterns.
Low frequencies are often reused as more stable semantic channels.
p-RoPE drops the lowest rotary frequencies to preserve those semantic channels and can improve performance.

References

[1] Barbero, F., Vitvitskyi, A., Perivolaropoulos, C., Pascanu, R., Veličković, P. (2024). Round and Round We Go! What makes Rotary Positional Encodings useful?. ICLR 2025.
[2] Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

p-RoPE: What Makes Rotary Positional Encodings Useful?

Worked Example: Frequency Roles in a Trained Model

What the Paper Tries to Explain

The p-RoPE Idea

Why This Makes Sense

What It Adds to the Positional-Encoding Story

Practical Takeaway

✅ Key Takeaways

References

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

Worked Example: Frequency Roles in a Trained Model

What the Paper Tries to Explain

The p-RoPE Idea

Why This Makes Sense

What It Adds to the Positional-Encoding Story

Practical Takeaway

✅ Key Takeaways

References

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization