GAPE: Remember to Forget — Gated Adaptive Positional Encoding

4 minute read

Published:

TL;DR: RoPE breaks when sequences extend beyond the training window — rotary phases go out-of-distribution, causing spurious long-range alignments and attention diffusion. GAPE adds a content-aware logit bias with two learned gates (query-gate contracts irrelevant context; key-gate protects important distant tokens) without touching the rotary geometry. Drop-in, no fine-tuning needed, provably sharper attention.
Paper: "Remember to Forget: Gated Adaptive Positional Encoding"  ·  arXiv:2605.10414
Authors: R. Ali, A. Borgi, C. Irwin, M. Severino, P. Liò
Venue: arXiv preprint, 2026  ·  📄 Read the paper

The RoPE Long-Context Problem

Rotary Positional Encoding (RoPE) is the positional scheme used in almost every modern LLM — LLaMA, Mistral, Gemma, Qwen. It encodes position by rotating query and key vectors in frequency-specific planes, so the dot-product between a query at position m and a key at position n depends only on their relative distance m−n.

This works beautifully within the training range. But when you extend context beyond what the model saw during training:

  • Rotary phases at large relative distances enter out-of-distribution regimes — the model has never seen those angular configurations.
  • Attention becomes diffuse: scores spread across irrelevant distant tokens rather than concentrating on relevant ones.
  • Spurious long-range alignments emerge: distant tokens with “accidentally” matching OOD rotary phases receive high attention.

Existing fixes (RoPE scaling, YaRN, LONGROPE) mostly rescale frequencies to handle longer ranges, but they trade local positional resolution for global stability. None target the content mismatch between relevant and irrelevant distant tokens.

GAPE: Two Gates on the Logits

GAPE introduces a content-aware additive bias directly into the pre-softmax attention logits, after the rotary dot-product is computed:

\[a_{mn} = \frac{q_m^\top k_n}{\sqrt{d}} + \underbrace{g_q(q_m) \cdot g_k(k_n)}_{\text{GAPE bias}}\]

The two gates are:

  • Query gate g_q(q_m): a scalar function of the query vector. Learns to output a negative value for queries that are “looking for something specific” — this contracts the attention mass assigned to distant, unprotected tokens.
  • Key gate g_k(k_n): a scalar function of the key vector. Learns to output a positive value for keys that carry salient content — this protects important distant tokens from being suppressed.

The decoupling is critical: the query gate controls forgetting (global distance-based suppression), while the key gate controls remembering (token-specific survival). The rotary geometry is untouched.

GAPE mechanism: content-aware attention logit bias separating contraction from token survival
Figure 1 — GAPE adds a factored logit bias after the rotary dot-product. The query gate (left path) suppresses irrelevant long-range context; the key gate (right path) preserves salient distant tokens. The rotary geometry remains unchanged.

Theoretical Guarantee

The paper proves that protected tokens (high g_k value) remain accessible regardless of distance — their effective attention logit is boosted by the key gate, counteracting any rotary-induced suppression. Conversely, for unprotected tokens, the attention mass decays as a function of the query gate value, giving a formal “forgetting” property for irrelevant context.

Empirical Validation

NIAH: Needle-in-a-Haystack Retrieval

The Needle-in-a-Haystack (NIAH) benchmark places a critical fact (the “needle”) at various positions in a long context and asks the model to retrieve it. GAPE consistently places sharper attention on the needle token at all context lengths and needle positions, even at 4× training context length.

NIAH retrieval: needle near vs. far, 1x/2x/4x context
Figure 2 — NIAH retrieval scores at 1×, 2×, and 4× training context. GAPE (blue) maintains high recall at all context extensions; the RoPE baseline (orange) degrades significantly at 2× and collapses at 4×.

Attention Sharpness

The key gate’s mechanistic effect is visible directly in the attention maps: GAPE produces sharper, more focused attention patterns compared to the vanilla RoPE baseline.

Mechanistic behavior of GAPE gates in NIAH task
Figure 3 — Attention maps on the NIAH task. With GAPE (right), attention concentrates tightly on the needle token; without GAPE (left), attention diffuses across the haystack at long ranges.

OOD Perplexity

OOD perplexity under context extension
Figure 4 — Perplexity as context length increases beyond the training window. GAPE (blue) shows slower perplexity growth compared to the RoPE baseline, confirming improved out-of-distribution robustness for language modelling.

✅ Key Takeaways

  • GAPE adds a factored content-aware logit bias — query-gate × key-gate — that decouples "forgetting irrelevant context" from "protecting salient distant tokens".
  • The rotary geometry of RoPE is completely preserved; GAPE is a drop-in augmentation requiring no architectural changes.
  • Formal guarantee: protected tokens (high key-gate) remain accessible; unprotected distant tokens' attention mass decays with the query gate.
  • Empirical gains on NIAH retrieval and long-context benchmarks at 1×, 2×, and 4× training context.