LongRoPE: Extending Context to 2 Million Tokens

4 minute read

Published:

TL;DR: LongRoPE extends context by learning a separate rescaling factor λᵢ for each RoPE dimension, found via evolutionary search. Unlike YaRN's three-zone approximation, LongRoPE finds the optimal per-dimension factors directly — enabling 2M-token context with only two short fine-tuning stages.

Pushing Beyond YaRN

YaRN divides RoPE dimensions into three zones (high-freq, mid-freq, low-freq) with a hand-designed blending function. This works well up to ~128k tokens. But at extreme lengths — 512k, 1M, 2M — the approximation breaks down.

LongRoPE (Ding et al., Microsoft Research, 2024) takes a fundamentally different approach: instead of designing the rescaling analytically, search for the optimal per-dimension rescaling factors directly.

In standard RoPE, each dimension pair i uses frequency θᵢ. To extend context, all methods modify these frequencies. LongRoPE generalises: each dimension i gets its own learned rescaling factor λᵢ:

θᵢ_new = θᵢ / λᵢ

Setting all λᵢ = s (the scale factor) gives linear interpolation. Setting λᵢ via the NTK formula gives NTK scaling. YaRN approximates the optimal λᵢ with a three-zone formula.

LongRoPE instead searches for the optimal vector λ = [λ₀, λ₁, …, λ_{d/2-1}] directly using an evolutionary search algorithm (specifically, a variant of CMA-ES — Covariance Matrix Adaptation Evolution Strategy).

Objective function: on a set of long-context validation documents, evaluate perplexity for a given λ vector.

Search procedure:

  1. Initialise λ using YaRN’s formula as a warm start
  2. Run evolutionary search (population of candidate λ vectors)
  3. For each candidate: compute attention scores with modified RoPE, measure perplexity
  4. Select best candidates, apply mutations, repeat
  5. Return λ that achieves minimum perplexity on long sequences

The search is done with the frozen original model weights — no gradient updates. Only λ is optimised (it is not a learned parameter in the usual sense; it is found by black-box search).

Why search instead of derive? The optimal per-dimension rescaling is not analytically tractable. It depends on what patterns the model has learned during pre-training, which is different for each model architecture and training distribution. Search directly optimises what matters — perplexity on long sequences — without assumptions about the optimal functional form.

The Two-Stage Training Pipeline

After finding λ via search, LongRoPE uses two short fine-tuning stages:

Stage 1: Extreme extension (e.g., 2M tokens)

  • Apply the searched λ
  • Fine-tune for ~400 steps on long documents (8k–128k sequence length)
  • This adapts the model weights to the new rotary frequencies at maximum context

Stage 2: Short-context recovery

  • The model after Stage 1 performs slightly worse at short contexts
  • Fine-tune with a smaller λ (less aggressive rescaling) for ~200 steps
  • This recovers near-original performance at the original training length

The two-stage approach is crucial: Stage 1 enables long context, Stage 2 prevents short-context regression — a problem that YaRN and NTK methods also face but do not explicitly address.

Non-Uniform Optimal Rescaling

A key empirical finding from LongRoPE: the optimal λᵢ values are highly non-uniform across dimensions. Some dimensions benefit from aggressive rescaling (large λᵢ), others benefit from almost none (λᵢ ≈ 1).

This explains why the three-zone approximation of YaRN works only up to moderate scales — at extreme lengths, the true optimal is complex enough that a three-zone formula is too coarse.

The searched λ vector typically shows:

  • Irregular oscillation rather than a clean monotone function of i
  • Some high-frequency dimensions needing no rescaling
  • Some mid-frequency dimensions needing more rescaling than YaRN assigns

Results

LongRoPE was evaluated on LLaMA 2-7B extended to various lengths:

Method4k8k32k128k256k512k
YaRN
LongRoPE

LongRoPE maintains near-original perplexity at 512k tokens. At 2M tokens (the headline result), perplexity increases but the model remains functional for tasks like document retrieval.

Comparison of Context Extension Methods

MethodMax practical extensionPer-dim tuningFine-tuningShort-context recovery
Linear interp~8×No (uniform)~1000 stepsPartial
NTK scaling~4×No (formula)OptionalNo
YaRN~32×Approximate (3 zones)~400 stepsNo
LongRoPE~1000×Yes (searched)~600 stepsYes (Stage 2)

Where LongRoPE Is Used

  • Phi-3 (Microsoft): Phi-3-mini and Phi-3-small use LongRoPE for 128k context
  • Phi-3.5-MoE: Also uses LongRoPE
  • The technique is increasingly adopted in models targeting very long contexts

Summary

LongRoPE’s key contributions:

  1. Per-dimension rescaling: each frequency gets its own λᵢ instead of a global formula
  2. Evolutionary search: finds optimal λ without gradient updates, using only perplexity as signal
  3. Two-stage fine-tuning: Stage 1 for long context, Stage 2 for short-context recovery
  4. Extreme extension: enables 2M-token context, well beyond what NTK or YaRN can handle

LongRoPE represents the current frontier of positional encoding research — pushing language models toward contexts that can fit entire books, codebases, or hours of transcribed audio.