Encoder vs Decoder vs Encoder-Decoder Transformers
Published:
The Three Families
The original 2017 Transformer (โAttention Is All You Needโ) was an encoder-decoder. The field then diverged into three distinct families, each optimised for different tasks.
1. Encoder-Only: BERT-style
Each encoder block contains:
- Bidirectional self-attention (every token sees every other token)
- Feed-forward network
- Layer norm + residual connections
Training objective: Masked Language Modelling (MLM). Random tokens in the input are replaced with [MASK], and the model predicts them. Because the answer is already in the sequence (just hidden), the model can attend bidirectionally.
What this is good at:
- Sentence classification (spam detection, sentiment)
- Token classification (NER, POS tagging)
- Question answering (span extraction)
- Sentence embeddings (semantic search)
What this cannot do: autoregressive generation. Generating token N+1 requires seeing token N+1 (bidirectional), which is circular during inference.
Examples: BERT, RoBERTa, DeBERTa, ALBERT, ModernBERT.
2. Decoder-Only: GPT-style
Each decoder block contains:
- Causal (masked) self-attention (each token sees only past tokens)
- Feed-forward network
- Layer norm + residual connections
Note: there is no cross-attention in decoder-only models. Each decoder block has only two sub-layers (not three), because there is no encoder output to attend to.
Training objective: Next-token prediction. Given tokens 1โฆN, predict token N+1. The causal mask ensures no peeking.
What this is good at:
- Text generation (stories, code, completions)
- In-context learning (few-shot prompting)
- Instruction following (with RLHF/fine-tuning)
- Anything you can frame as completion
What this is less natural for: tasks that require reading the full input before producing an output (e.g., translation, summarisation) โ though modern large decoder-only models handle these with prompting anyway.
Examples: GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Gemma, Claude.
3. Encoder-Decoder: T5-style
The encoder processes the full input bidirectionally. The decoder generates the output token by token, using:
- Causal self-attention (on its own generated tokens so far)
- Cross-attention (queries the encoderโs output at each step)
- Feed-forward network
Training objective: Span corruption (T5) or similar sequence-to-sequence objectives.
What this is good at:
- Machine translation (full input available, output generated)
- Summarisation (read document, write summary)
- Question answering with generation (read context, write answer)
- Any task naturally framed as input โ output transformation
Examples: T5, BART, mT5, Flan-T5, NLLB (translation).
Side-by-Side Comparison
| Property | Encoder-only | Decoder-only | Encoder-Decoder |
|---|---|---|---|
| Self-attention type | Bidirectional | Causal | Both |
| Cross-attention | None | None | Decoder โ Encoder |
| Reads input | Fully, in parallel | Autoregressively | Fully (encoder) |
| Generates output | No (fixed-length) | Autoregressively | Autoregressively |
| Training objective | MLM, NSP | Next-token prediction | Seq2seq |
| Good for | Understanding | Generation | Transformation |
| Examples | BERT, DeBERTa | GPT, LLaMA, Claude | T5, BART, Flan-T5 |
The Attention Mask Differences
Encoder self-attention (bidirectional):
โ โ โ โ
โ โ โ โ
โ โ โ โ
โ โ โ โ
Decoder self-attention (causal):
โ โ โ โ
โ โ โ โ
โ โ โ โ
โ โ โ โ
Decoder cross-attention (full encoder access):
โ โ โ โ โ decoder pos 1 attends to all encoder positions
โ โ โ โ โ decoder pos 2 attends to all encoder positions
The mask tells the whole story. Encoder: open. Decoder: lower triangular. Cross-attention: open to the encoder.
Summary
The three architectures are not better or worse in absolute terms โ they are optimised for different settings:
- Understanding a fixed input? โ Encoder-only
- Generating open-ended text? โ Decoder-only
- Transforming one sequence into another? โ Encoder-decoder
Modern LLMs (GPT-4, Claude, LLaMA) are decoder-only, using scale and prompting to cover all three use cases. But for specialised tasks with a clear input-output structure and limited compute, encoder-decoder models remain competitive.
