Attention Masks: Causal, Padding, and Bidirectional

4 minute read

Published:

TL;DR: Attention masks control which pairs of tokens can attend to each other. Bidirectional masking (BERT) sees everything. Causal masking (GPT) sees only the past. Padding masks ignore filler tokens. The mask choice determines the model's fundamental capability.

Why Masks Exist

Raw scaled dot-product attention lets every token attend to every other token. Sometimes that is exactly what you want. But often you need to restrict this:

  • Language modelling: token 5 must not see token 6 โ€” that would be cheating during training
  • Batching: sentences padded to equal length should not attend to the padding
  • Encoder-decoder: the decoder needs restricted attention but the encoder does not

Masks solve all of these. They are applied to the raw attention scores before softmax โ€” typically by adding โˆ’โˆž to masked positions, which softmax converts to 0 weight.

masked_scores = scores + mask    where mask[i,j] = 0 (attend) or โˆ’โˆž (block)

1. Bidirectional Mask (BERT-style)

A bidirectional mask places no restrictions. Every token can attend to every other token, including itself.

     The  cat  sat  on   the  mat
The  [ โœ“   โœ“    โœ“    โœ“    โœ“    โœ“ ]
cat  [ โœ“   โœ“    โœ“    โœ“    โœ“    โœ“ ]
sat  [ โœ“   โœ“    โœ“    โœ“    โœ“    โœ“ ]
on   [ โœ“   โœ“    โœ“    โœ“    โœ“    โœ“ ]
the  [ โœ“   โœ“    โœ“    โœ“    โœ“    โœ“ ]
mat  [ โœ“   โœ“    โœ“    โœ“    โœ“    โœ“ ]

Every cell is open. Each tokenโ€™s representation is built from the entire sequence simultaneously.

Used by: BERT, RoBERTa, DeBERTa, any encoder-only model.
Good for: classification, NER, question answering โ€” tasks where you read the whole input before deciding.
Cannot do: autoregressive generation โ€” you cannot generate token 6 if token 5 already sees token 6.

2. Causal Mask (GPT-style)

A causal (autoregressive) mask enforces that token i can only attend to tokens โ‰ค i. The future is blocked.

      The  cat  sat  on   the  mat
The  [ โœ“   โœ—    โœ—    โœ—    โœ—    โœ— ]
cat  [ โœ“   โœ“    โœ—    โœ—    โœ—    โœ— ]
sat  [ โœ“   โœ“    โœ“    โœ—    โœ—    โœ— ]
on   [ โœ“   โœ“    โœ“    โœ“    โœ—    โœ— ]
the  [ โœ“   โœ“    โœ“    โœ“    โœ“    โœ— ]
mat  [ โœ“   โœ“    โœ“    โœ“    โœ“    โœ“ ]

The attention matrix is lower-triangular. The diagonal is always visible (self-attention). Everything above the diagonal is โˆ’โˆž.

Used by: GPT, GPT-2, GPT-3, GPT-4, LLaMA, Mistral, all decoder-only models.
Good for: language generation โ€” at each step, the model predicts the next token from all previous tokens.
Key property: during training, all positions can be processed in parallel (the mask handles causality). During inference, tokens are generated one at a time.

Why causal masking enables parallel training: Without it, you would need to run the model N times to predict each token sequentially. With the causal mask, all N predictions happen in one forward pass โ€” each row of the attention matrix uses only the visible positions.

3. Padding Mask

When batching sequences of different lengths, shorter sequences are padded to the maximum length in the batch. Padding tokens carry no meaningful information and should not influence attention.

Sentence A: "The cat sat" [PAD] [PAD]
Sentence B: "Go"          [PAD] [PAD] [PAD]

The padding mask blocks attention to [PAD] positions:

           The  cat  sat  PAD  PAD
The   [ โœ“    โœ“    โœ“    โœ—    โœ—  ]
cat   [ โœ“    โœ“    โœ“    โœ—    โœ—  ]
sat   [ โœ“    โœ“    โœ“    โœ—    โœ—  ]
PAD   [ โœ—    โœ—    โœ—    โœ—    โœ—  ]
PAD   [ โœ—    โœ—    โœ—    โœ—    โœ—  ]

Padding masks are applied on top of whatever other mask is in use. A GPT model uses a causal mask AND a padding mask simultaneously.

4. Combining Masks

In practice, masks are combined additively. A decoder in an encoder-decoder model (like T5) uses:

  • Causal mask on its own tokens (cannot look ahead)
  • No mask on cross-attention to the encoder (can see the full encoded input)
  • Padding mask on both (ignores padding in both sequences)
ModelSelf-attentionCross-attention
BERT (encoder)Bidirectionalโ€”
GPT (decoder)Causalโ€”
T5 encoderBidirectionalโ€”
T5 decoderCausalFull (to encoder)

Implementation Detail

In PyTorch, attention masks are typically boolean or float tensors added to raw scores before softmax:

# Causal mask for sequence length L
mask = torch.triu(torch.ones(L, L), diagonal=1).bool()
scores = scores.masked_fill(mask, float('-inf'))
attn_weights = torch.softmax(scores, dim=-1)

The masked_fill replaces masked positions with โˆ’โˆž. After softmax, those positions become exactly 0 โ€” contributing nothing to the weighted value sum.

Summary

Mask typeAllowsUsed for
BidirectionalAll positionsEncoding, understanding
CausalPast + present onlyLanguage generation
PaddingNon-pad positions onlyBatch processing
CombinedIntersection of rulesEncoder-decoder models

Attention masks are the simplest mechanism in the Transformer, but they define the entire generative capability of the architecture. Change the mask, change the model family.