YaRN: Yet Another RoPE Extensionn Method

4 minute read

Published:

TL;DR: YaRN (Peng et al., 2023) divides RoPE dimensions into three groups — low-frequency dims get linear interpolation, high-frequency dims get no modification, mid-frequency dims get NTK-style scaling — and then applies an attention temperature correction. The result is significantly better long-context performance than either method alone, with only ~400 fine-tuning steps needed.

The Problem YaRN Solves

Both linear interpolation and NTK scaling are global — they apply the same transformation to all RoPE frequency dimensions. But different dimensions encode different kinds of positional information:

  • High-frequency dims (small wavelength): encode fine-grained local position. They should not be interpolated — compressing their cycles destroys local structure.
  • Low-frequency dims (large wavelength): encode long-range position. They can be linearly interpolated without harm.
  • Mid-frequency dims: need something in between.

YaRN handles each group differently.

The Three Zones

YaRN divides the d/2 frequency dimensions into three groups based on their wavelength λᵢ = 2π/θᵢ relative to the training length L and target length L’:

Low frequency: λᵢ > L' · β → linear interpolation High frequency: λᵢ < L · α → no change (extrapolation) Mid frequency: L · α ≤ λᵢ ≤ L' · β → NTK-like ramp

Default hyperparameters: α = 1, β = 32 (tuned empirically). The ramp function smoothly interpolates between the two strategies across the mid-frequency range.

The Ramp Function

For each dimension i, YaRN defines a blending factor r(i):

r(i) = 0   if high-frequency (no change)
r(i) = 1   if low-frequency (full interpolation)
r(i) = smooth ramp  otherwise

The effective frequency for dimension i becomes:

θᵢ_new = (1 − r(i)) · θᵢ + r(i) · (θᵢ / s)

Where s = L’/L is the scale factor. When r(i) = 0: θᵢ unchanged (high-freq). When r(i) = 1: θᵢ / s (full interpolation). In between: a blend.

This gives each dimension group the treatment it needs, rather than applying a single global rule.

The Attention Temperature Fix

A subtlety that NTK scaling ignores: when you change RoPE frequencies, the distribution of attention logits shifts. Longer contexts naturally produce larger dot products, and the softmax temperature becomes miscalibrated.

YaRN addresses this with a learned attention temperature correction:

Attention(Q, K, V) = softmax( Q Kᵀ / (√d_k · t) ) · V

Where t = 0.1 · ln(s) + 1 (with s = L’/L). For s=4 (4× context extension), t ≈ 1.138.

This dampens attention logits slightly, keeping the softmax distribution well-calibrated at longer contexts. Without this correction, models tend to “spread” attention too uniformly at long range — a well-known failure mode.

Why temperature matters: At long context, if attention entropy grows unchecked, the model attends roughly equally to all tokens — losing the ability to focus on relevant information. The temperature correction counteracts this, maintaining sharp attention even over thousands of tokens.

Training Recipe

YaRN requires minimal fine-tuning:

  1. Modify RoPE with the three-zone frequency scheme
  2. Apply attention temperature correction
  3. Fine-tune for ~400 steps on long-context data (compared to thousands for full context extension training)

This makes YaRN practical: you can take an existing model (e.g., LLaMA-2 7B trained at 4096 tokens) and extend it to 128k context with a short fine-tuning run.

Results vs Other Methods

Method2k→8k quality2k→32k qualityFine-tuning steps
Linear interpolationGoodDegrades~1000
NTK scalingGoodModerate0 (but better with some)
YaRNBestBest~400

YaRN consistently outperforms both methods on long-context benchmarks (SCROLLS, LongBench) at the same scale, with less fine-tuning than linear interpolation.

Models Using YaRN

  • Mistral 7B v0.2 (context extension from 8k to 32k)
  • Qwen2 series (various context lengths)
  • LLaMA-2 fine-tuned variants (community-produced 32k/64k/128k models)

YaRN is the standard method for context extension in the open-source community.

Comparison of Context Extension Methods

MethodHigh-freqLow-freqTemperatureFine-tuneQuality
Linear interpBrokenGoodNo~1000 stepsModerate
NTK scalingGoodGoodNo0Good
NTK (dynamic)GoodGoodNo0Good
YaRNPreservedGoodYes~400Best

Summary

YaRN improves on earlier RoPE extension methods by:

  1. Treating different frequency bands differently (local, transitional, long-range)
  2. Correcting attention temperature to maintain focus at long context
  3. Requiring minimal fine-tuning (~400 steps)

It is the current community standard for extending the context of open-weight models, used in Mistral and many LLaMA derivatives.