Encoder vs Decoder vs Encoder-Decoder Transformers

4 minute read

Published:

TL;DR: Encoder-only models (BERT) read everything bidirectionally โ€” great for understanding. Decoder-only models (GPT) generate left-to-right โ€” great for generation. Encoder-decoder models (T5) encode the input fully, then generate the output โ€” great for transformation tasks like translation and summarisation.

The Three Families

The original 2017 Transformer (โ€œAttention Is All You Needโ€) was an encoder-decoder. The field then diverged into three distinct families, each optimised for different tasks.

1. Encoder-Only: BERT-style

Input โ†’ [Encoder Block ร— N] โ†’ Contextual representations

Each encoder block contains:

  • Bidirectional self-attention (every token sees every other token)
  • Feed-forward network
  • Layer norm + residual connections

Training objective: Masked Language Modelling (MLM). Random tokens in the input are replaced with [MASK], and the model predicts them. Because the answer is already in the sequence (just hidden), the model can attend bidirectionally.

What this is good at:

  • Sentence classification (spam detection, sentiment)
  • Token classification (NER, POS tagging)
  • Question answering (span extraction)
  • Sentence embeddings (semantic search)

What this cannot do: autoregressive generation. Generating token N+1 requires seeing token N+1 (bidirectional), which is circular during inference.

Examples: BERT, RoBERTa, DeBERTa, ALBERT, ModernBERT.

2. Decoder-Only: GPT-style

Input โ†’ [Decoder Block ร— N] โ†’ Next-token probabilities

Each decoder block contains:

  • Causal (masked) self-attention (each token sees only past tokens)
  • Feed-forward network
  • Layer norm + residual connections

Note: there is no cross-attention in decoder-only models. Each decoder block has only two sub-layers (not three), because there is no encoder output to attend to.

Training objective: Next-token prediction. Given tokens 1โ€ฆN, predict token N+1. The causal mask ensures no peeking.

What this is good at:

  • Text generation (stories, code, completions)
  • In-context learning (few-shot prompting)
  • Instruction following (with RLHF/fine-tuning)
  • Anything you can frame as completion

What this is less natural for: tasks that require reading the full input before producing an output (e.g., translation, summarisation) โ€” though modern large decoder-only models handle these with prompting anyway.

Examples: GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Gemma, Claude.

Why did decoder-only win? Scale. Decoder-only models are simpler to scale (one attention type, no encoder-decoder interaction), and next-token prediction is a perfect self-supervised objective on any text. As scale increased, emergent capabilities made them competitive on understanding tasks too.

3. Encoder-Decoder: T5-style

Input โ†’ [Encoder Block ร— N] โ†’ Latent โ†“ Prompt โ†’ [Decoder Block ร— M] โ†’ Generated output

The encoder processes the full input bidirectionally. The decoder generates the output token by token, using:

  • Causal self-attention (on its own generated tokens so far)
  • Cross-attention (queries the encoderโ€™s output at each step)
  • Feed-forward network

Training objective: Span corruption (T5) or similar sequence-to-sequence objectives.

What this is good at:

  • Machine translation (full input available, output generated)
  • Summarisation (read document, write summary)
  • Question answering with generation (read context, write answer)
  • Any task naturally framed as input โ†’ output transformation

Examples: T5, BART, mT5, Flan-T5, NLLB (translation).

Side-by-Side Comparison

PropertyEncoder-onlyDecoder-onlyEncoder-Decoder
Self-attention typeBidirectionalCausalBoth
Cross-attentionNoneNoneDecoder โ†’ Encoder
Reads inputFully, in parallelAutoregressivelyFully (encoder)
Generates outputNo (fixed-length)AutoregressivelyAutoregressively
Training objectiveMLM, NSPNext-token predictionSeq2seq
Good forUnderstandingGenerationTransformation
ExamplesBERT, DeBERTaGPT, LLaMA, ClaudeT5, BART, Flan-T5

The Attention Mask Differences

Encoder self-attention (bidirectional):
โœ“ โœ“ โœ“ โœ“
โœ“ โœ“ โœ“ โœ“
โœ“ โœ“ โœ“ โœ“
โœ“ โœ“ โœ“ โœ“

Decoder self-attention (causal):
โœ“ โœ— โœ— โœ—
โœ“ โœ“ โœ— โœ—
โœ“ โœ“ โœ“ โœ—
โœ“ โœ“ โœ“ โœ“

Decoder cross-attention (full encoder access):
โœ“ โœ“ โœ“ โœ“   โ† decoder pos 1 attends to all encoder positions
โœ“ โœ“ โœ“ โœ“   โ† decoder pos 2 attends to all encoder positions

The mask tells the whole story. Encoder: open. Decoder: lower triangular. Cross-attention: open to the encoder.

Summary

The three architectures are not better or worse in absolute terms โ€” they are optimised for different settings:

  • Understanding a fixed input? โ†’ Encoder-only
  • Generating open-ended text? โ†’ Decoder-only
  • Transforming one sequence into another? โ†’ Encoder-decoder

Modern LLMs (GPT-4, Claude, LLaMA) are decoder-only, using scale and prompting to cover all three use cases. But for specialised tasks with a clear input-output structure and limited compute, encoder-decoder models remain competitive.