ALiBi: Attention with Linear Biases
Published:
The Simplest PE That Works
Press, Smith, and Lewis (2022) asked: do we even need position vectors? What if we just penalise attending to far-away tokens directly in the attention scores?
The idea: tokens that are far apart should pay a cost for attending to each other. Nearby tokens are cheap to attend to; distant ones are expensive.
They achieve this with a single number: a negative slope m per attention head.
The Formula
Thatโs it. The only change to standard attention is subtracting m ร |i โ j| from each score before the softmax. No PE vectors are added to embeddings at all.
Different attention heads use different slopes m, following a geometric sequence: {1/2, 1/4, 1/8, โฆ, 2^(-h)} for h heads. Some heads focus locally (large m = steep penalty); others look further (small m = gentle penalty).
Why It Extrapolates
Standard PE trains on sequences of length L. Beyond L, the model has never seen those position indices and performance degrades.
ALiBi never uses position indices at all โ only distances. At inference on a 4096-token sequence (when training was on 1024), the model sees distances like 1, 2, 3, โฆ 4095 โ but all it needs to do is subtract m ร distance. The penalty formula works at any distance, so extrapolation is essentially free.
The paper reports that models trained at 1024 tokens with ALiBi outperform sinusoidal and learned PE baselines even at inference lengths of 2048 and 4096 tokens.
Trade-Offs
- Pro: Zero extra parameters, trivially simple.
- Pro: Excellent out-of-the-box extrapolation.
- Con: The linear bias is a strong inductive bias โ locality is built in. Some tasks (e.g., cross-document retrieval) may prefer more flexible attention patterns.
- Con: Slightly outperformed by RoPE + YaRN in ultra-long-context regimes.
โ Key Takeaways
- ALiBi adds no PE vectors โ just subtracts
m ร |iโj|from each attention score. - Different slopes m per head allow some heads to focus locally, others globally.
- Extrapolates to longer sequences than training because it only uses relative distance, never absolute position indices.
- Used in BLOOM (176B) and MPT; now somewhat superseded by RoPE + YaRN for long-context LLMs.
