ALiBi: Attention with Linear Biases

3 minute read

Published:

TL;DR: ALiBi (Attention with Linear Biases) adds no positional embeddings to token vectors. Instead, it subtracts a fixed linear penalty โ€” proportional to token distance โ€” from each attention score. Zero learned parameters, excellent length extrapolation.

The Simplest PE That Works

Press, Smith, and Lewis (2022) asked: do we even need position vectors? What if we just penalise attending to far-away tokens directly in the attention scores?

The idea: tokens that are far apart should pay a cost for attending to each other. Nearby tokens are cheap to attend to; distant ones are expensive.

They achieve this with a single number: a negative slope m per attention head.

The Formula

Attention score(i, j) = q_i ยท k_j / โˆšd_k โˆ’ m ร— |i โˆ’ j|

Thatโ€™s it. The only change to standard attention is subtracting m ร— |i โˆ’ j| from each score before the softmax. No PE vectors are added to embeddings at all.

Different attention heads use different slopes m, following a geometric sequence: {1/2, 1/4, 1/8, โ€ฆ, 2^(-h)} for h heads. Some heads focus locally (large m = steep penalty); others look further (small m = gentle penalty).

Raw attention scores โˆ’ Linear bias (m ร— |iโˆ’j|) 5.2 4.1 3.3 2.8 4.5 5.8 4.2 3.1 3.7 4.9 5.5 3.9 0 m 2m 3m m 0 m 2m 2m m 0 m = โ†’ Softmax The bias grows linearly with distance, but softmax normalises. Even at very long sequences (never seen in training), the bias pattern extrapolates naturally. Models using ALiBi: BLOOM (176B) MPT-7B MPT-30B OpenLLaMA*
Figure 1: ALiBi subtracts m ร— |iโˆ’j| from each attention score. Darker red = larger penalty for farther distance. The pattern extrapolates perfectly to any sequence length.

Why It Extrapolates

Standard PE trains on sequences of length L. Beyond L, the model has never seen those position indices and performance degrades.

ALiBi never uses position indices at all โ€” only distances. At inference on a 4096-token sequence (when training was on 1024), the model sees distances like 1, 2, 3, โ€ฆ 4095 โ€” but all it needs to do is subtract m ร— distance. The penalty formula works at any distance, so extrapolation is essentially free.

The paper reports that models trained at 1024 tokens with ALiBi outperform sinusoidal and learned PE baselines even at inference lengths of 2048 and 4096 tokens.

Trade-Offs

  • Pro: Zero extra parameters, trivially simple.
  • Pro: Excellent out-of-the-box extrapolation.
  • Con: The linear bias is a strong inductive bias โ€” locality is built in. Some tasks (e.g., cross-document retrieval) may prefer more flexible attention patterns.
  • Con: Slightly outperformed by RoPE + YaRN in ultra-long-context regimes.

โœ… Key Takeaways

  • ALiBi adds no PE vectors โ€” just subtracts m ร— |iโˆ’j| from each attention score.
  • Different slopes m per head allow some heads to focus locally, others globally.
  • Extrapolates to longer sequences than training because it only uses relative distance, never absolute position indices.
  • Used in BLOOM (176B) and MPT; now somewhat superseded by RoPE + YaRN for long-context LLMs.