Attention Masks: Causal, Padding, and Bidirectional
Published:
Why Masks Exist
Raw scaled dot-product attention lets every token attend to every other token. Sometimes that is exactly what you want. But often you need to restrict this:
- Language modelling: token 5 must not see token 6 โ that would be cheating during training
- Batching: sentences padded to equal length should not attend to the padding
- Encoder-decoder: the decoder needs restricted attention but the encoder does not
Masks solve all of these. They are applied to the raw attention scores before softmax โ typically by adding โโ to masked positions, which softmax converts to 0 weight.
1. Bidirectional Mask (BERT-style)
A bidirectional mask places no restrictions. Every token can attend to every other token, including itself.
The cat sat on the mat
The [ โ โ โ โ โ โ ]
cat [ โ โ โ โ โ โ ]
sat [ โ โ โ โ โ โ ]
on [ โ โ โ โ โ โ ]
the [ โ โ โ โ โ โ ]
mat [ โ โ โ โ โ โ ]
Every cell is open. Each tokenโs representation is built from the entire sequence simultaneously.
Used by: BERT, RoBERTa, DeBERTa, any encoder-only model.
Good for: classification, NER, question answering โ tasks where you read the whole input before deciding.
Cannot do: autoregressive generation โ you cannot generate token 6 if token 5 already sees token 6.
2. Causal Mask (GPT-style)
A causal (autoregressive) mask enforces that token i can only attend to tokens โค i. The future is blocked.
The cat sat on the mat
The [ โ โ โ โ โ โ ]
cat [ โ โ โ โ โ โ ]
sat [ โ โ โ โ โ โ ]
on [ โ โ โ โ โ โ ]
the [ โ โ โ โ โ โ ]
mat [ โ โ โ โ โ โ ]
The attention matrix is lower-triangular. The diagonal is always visible (self-attention). Everything above the diagonal is โโ.
Used by: GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral, all decoder-only models.
Good for: language generation โ at each step, the model predicts the next token from all previous tokens.
Key property: during training, all positions can be processed in parallel (the mask handles causality). During inference, tokens are generated one at a time.
3. Padding Mask
When batching sequences of different lengths, shorter sequences are padded to the maximum length in the batch. Padding tokens carry no meaningful information and should not influence attention.
Sentence A: "The cat sat" [PAD] [PAD]
Sentence B: "Go" [PAD] [PAD] [PAD]
The padding mask blocks attention to [PAD] positions:
The cat sat PAD PAD
The [ โ โ โ โ โ ]
cat [ โ โ โ โ โ ]
sat [ โ โ โ โ โ ]
PAD [ โ โ โ โ โ ]
PAD [ โ โ โ โ โ ]
Padding masks are applied on top of whatever other mask is in use. A GPT model uses a causal mask AND a padding mask simultaneously.
4. Combining Masks
In practice, masks are combined additively. A decoder in an encoder-decoder model (like T5) uses:
- Causal mask on its own tokens (cannot look ahead)
- No mask on cross-attention to the encoder (can see the full encoded input)
- Padding mask on both (ignores padding in both sequences)
| Model | Self-attention | Cross-attention |
|---|---|---|
| BERT (encoder) | Bidirectional | โ |
| GPT (decoder) | Causal | โ |
| T5 encoder | Bidirectional | โ |
| T5 decoder | Causal | Full (to encoder) |
Implementation Detail
In PyTorch, attention masks are typically boolean or float tensors added to raw scores before softmax:
# Causal mask for sequence length L
mask = torch.triu(torch.ones(L, L), diagonal=1).bool()
scores = scores.masked_fill(mask, float('-inf'))
attn_weights = torch.softmax(scores, dim=-1)
The masked_fill replaces masked positions with โโ. After softmax, those positions become exactly 0 โ contributing nothing to the weighted value sum.
Summary
| Mask type | Allows | Used for |
|---|---|---|
| Bidirectional | All positions | Encoding, understanding |
| Causal | Past + present only | Language generation |
| Padding | Non-pad positions only | Batch processing |
| Combined | Intersection of rules | Encoder-decoder models |
Attention masks are the simplest mechanism in the Transformer, but they define the entire generative capability of the architecture. Change the mask, change the model family.
