Position Interpolation: Extending RoPE with Minimal Fine-Tuning

5 minute read

Published: May 29, 2026

TL;DR: Position Interpolation (PI) stretches RoPE to longer contexts by compressing large positions back into the range seen during training. It is simple, works surprisingly well, and usually needs only light fine-tuning. Conceptually, it is the baseline long-context RoPE fix that later methods such as NTK scaling and YaRN improved upon.

Paper: "Extending Context Window of Large Language Models via Positional Interpolation" · arXiv:2306.15595
Authors: Shouyuan Chen, Sherman Wong, Liangchen Luo, Yuandong Tian
Venue: arXiv 2023 · 📄 Read the paper

First page of the Position Interpolation paper — Paper preview — Extending Context Window of Large Language Models via Positional Interpolation (Chen et al., 2023).

Figure 1 — Position Interpolation extends a RoPE model by squeezing larger positions back into the positional range seen during training. The architecture stays the same; the trick is to remap coordinates before applying the rotary transform. Source: [1].

The central idea: do not ask the model to handle raw positions it has never seen. Instead, remap those positions into a compressed coordinate system that still looks familiar to the original RoPE frequencies.

Worked Example: Interpolating from 4k to 16k

Model: LLaMA-2 7B, trained at L_train = 4096, target L_target = 16384 (4× extension).

Without Position Interpolation — naive extrapolation:

Token at position 5000: RoPE angle for dim i=0 = 5000 × θ₀ = 5000 × 1.0 = 5000 radians
The model during training never saw an angle beyond 4096 radians for this dimension
The attention pattern for this token is completely out-of-distribution → garbage output

With Position Interpolation:

Rescale: pos_new = 5000 × (4096 / 16384) = 5000 × 0.25 = 1250
RoPE angle for dim i=0 = 1250 × 1.0 = 1250 radians
The model saw angles up to 4096 during training — 1250 is well within this range ✓
Token at position 16383 maps to: 16383 × 0.25 = 4095.75 ≈ still within training range ✓

The cost: positions that were 1 apart (relative angle = θ) now look like they are 0.25 apart (relative angle = 0.25 × θ). The model’s learned sense of “adjacent” vs “nearby” is compressed. A short fine-tuning run (1000 steps) lets it readapt its attention patterns to the new compressed geometry.

High-frequency degradation: For the highest-frequency dimension (i=63, θ₆₃ ≈ 1/7244), adjacent tokens produce a relative angle of 0.25/7244 ≈ 0.0000345 radians after interpolation — very small, and the model may struggle to distinguish adjacent from nearby tokens. This is precisely what NTK-Aware Scaling and YaRN later improved upon.

Why It Was Such a Big Deal

Once RoPE-based LLMs became standard, the obvious next question was: how do we make them handle longer context without retraining from scratch?

Position Interpolation gave one of the first practical answers. Instead of changing the attention mechanism or inventing a new positional encoding, it simply rescales positions:

\[ \text{pos}_{\text{new}} = \text{pos} \cdot \frac{L_{\text{train}}}{L_{\text{target}}} \]

If the original model was trained up to (L_{\text{train}}) and you want to run it at (L_{\text{target}}), you compress all coordinates so the rotary angles remain inside a more familiar regime.

What This Changes in Practice

RoPE normally rotates queries and keys by an angle proportional to position. If you double or quadruple context length, those angles can move into regimes the model never learned to interpret.

Position Interpolation avoids that by saying:

the model may read a longer sequence, but the positional coordinates fed to RoPE should move more slowly.

So the token sequence becomes longer, but the positional trajectory through rotary space becomes denser and less extreme.

Why Fine-Tuning Still Matters

PI is much better than naive extrapolation, but it is not magic. Compressing positions changes the geometry of how nearby and far-away tokens are separated. The model usually benefits from a short adaptation phase so it can relearn how to use that modified geometry.

That is why Position Interpolation is often described as:

simple
effective
cheap to adapt

but not fully “free” in the way NTK-aware scaling tries to be.

How It Fits in the RoPE Family

Position Interpolation is the baseline long-context RoPE extension recipe. Later methods can be read as refinements:

NTK-aware scaling changes frequencies instead of compressing positions directly
YaRN mixes interpolation and extrapolation across frequency bands
LongRoPE searches for dimension-wise rescaling schedules

So PI is worth knowing because it is the conceptual bridge between plain RoPE and the more advanced long-context methods.

When to Use It

PI makes sense when:

you already have a trained RoPE model
you want a longer context quickly
you can afford a short fine-tuning run

It is especially useful as a baseline, because if a fancier method does not clearly beat PI, that method probably is not worth the complexity.

✅ Key Takeaways

Position Interpolation extends RoPE by compressing positions before applying rotary embeddings.
It preserves the architecture and keeps the change local to the positional mechanism.
It usually works well with light fine-tuning, making it a practical context-extension baseline.
Conceptually, it sits right before NTK scaling, YaRN, and LongRoPE in the long-context RoPE story.

References

[1] Chen, S., Wong, S., Chen, L., Tian, Y. (2023). Extending Context Window of Large Language Models via Positional Interpolation. arXiv 2023.
[2] Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv 2021.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Position Interpolation: Extending RoPE with Minimal Fine-Tuning

Worked Example: Interpolating from 4k to 16k

Why It Was Such a Big Deal

What This Changes in Practice

Why Fine-Tuning Still Matters

How It Fits in the RoPE Family

When to Use It

✅ Key Takeaways

References

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

Worked Example: Interpolating from 4k to 16k

Why It Was Such a Big Deal

What This Changes in Practice

Why Fine-Tuning Still Matters

How It Fits in the RoPE Family

When to Use It

✅ Key Takeaways

References

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization