GAPE: Remember to Forget — Gated Adaptive Positional Encoding
Published:
The RoPE Long-Context Problem
Rotary Positional Encoding (RoPE) is the positional scheme used in almost every modern LLM — LLaMA, Mistral, Gemma, Qwen. It encodes position by rotating query and key vectors in frequency-specific planes, so the dot-product between a query at position m and a key at position n depends only on their relative distance m−n.
This works beautifully within the training range. But when you extend context beyond what the model saw during training:
- Rotary phases at large relative distances enter out-of-distribution regimes — the model has never seen those angular configurations.
- Attention becomes diffuse: scores spread across irrelevant distant tokens rather than concentrating on relevant ones.
- Spurious long-range alignments emerge: distant tokens with “accidentally” matching OOD rotary phases receive high attention.
Existing fixes (RoPE scaling, YaRN, LONGROPE) mostly rescale frequencies to handle longer ranges, but they trade local positional resolution for global stability. None target the content mismatch between relevant and irrelevant distant tokens.
GAPE: Two Gates on the Logits
GAPE introduces a content-aware additive bias directly into the pre-softmax attention logits, after the rotary dot-product is computed:
\[a_{mn} = \frac{q_m^\top k_n}{\sqrt{d}} + \underbrace{g_q(q_m) \cdot g_k(k_n)}_{\text{GAPE bias}}\]The two gates are:
- Query gate g_q(q_m): a scalar function of the query vector. Learns to output a negative value for queries that are “looking for something specific” — this contracts the attention mass assigned to distant, unprotected tokens.
- Key gate g_k(k_n): a scalar function of the key vector. Learns to output a positive value for keys that carry salient content — this protects important distant tokens from being suppressed.
The decoupling is critical: the query gate controls forgetting (global distance-based suppression), while the key gate controls remembering (token-specific survival). The rotary geometry is untouched.

Theoretical Guarantee
The paper proves that protected tokens (high g_k value) remain accessible regardless of distance — their effective attention logit is boosted by the key gate, counteracting any rotary-induced suppression. Conversely, for unprotected tokens, the attention mass decays as a function of the query gate value, giving a formal “forgetting” property for irrelevant context.
Empirical Validation
NIAH: Needle-in-a-Haystack Retrieval
The Needle-in-a-Haystack (NIAH) benchmark places a critical fact (the “needle”) at various positions in a long context and asks the model to retrieve it. GAPE consistently places sharper attention on the needle token at all context lengths and needle positions, even at 4× training context length.

Attention Sharpness
The key gate’s mechanistic effect is visible directly in the attention maps: GAPE produces sharper, more focused attention patterns compared to the vanilla RoPE baseline.

OOD Perplexity

✅ Key Takeaways
- GAPE adds a factored content-aware logit bias — query-gate × key-gate — that decouples "forgetting irrelevant context" from "protecting salient distant tokens".
- The rotary geometry of RoPE is completely preserved; GAPE is a drop-in augmentation requiring no architectural changes.
- Formal guarantee: protected tokens (high key-gate) remain accessible; unprotected distant tokens' attention mass decays with the query gate.
- Empirical gains on NIAH retrieval and long-context benchmarks at 1×, 2×, and 4× training context.
