T5: Every NLP Task as Text-to-Text

2 minute read

Published:

TL;DR: T5 (Raffel et al., Google, 2019) recasts every NLP task as: take text in, produce text out. Translation, summarisation, classification, QA — all trained jointly with a single cross-entropy loss. Encoder-decoder architecture with relative position biases.

The Unifying Idea

Different NLP models used to need different architectures and training objectives. BERT needs a classification head for sentiment; a seq2seq model for translation; another model for QA.

T5 asks: what if we just described the task in natural language as part of the input?

translate English to German: That is good. → Das ist gut.
summarize: Scientists at NASA discovered... → NASA finds new water on Mars.
sentiment: The movie was terrible. → negative
cola sentence: She run fast. → not acceptable

Every task gets a text prefix that tells the model what to do. The model is trained with teacher-forcing on the target text. At inference, it generates the answer token-by-token.

Architecture: Full Encoder-Decoder

T5 uses the original Transformer’s full encoder-decoder:

  • Encoder: reads the input (prefix + content), builds rich contextual representations.
  • Decoder: generates the output token-by-token, attending to both its own previous outputs (causal self-attention) and the encoder representations (cross-attention).

This is different from BERT (encoder only) and GPT (decoder only).

ENCODER "translate en to de: That is good." Bidirectional Self-Attention Feed-Forward + Add&Norm Encoder Output (all positions) cross-attention DECODER "Das ist gut." (shifted right) Causal Self-Attention (masked) Cross-Attention (reads encoder) Feed-Forward + Add&Norm Output token probabilities
Figure 1: T5's encoder-decoder architecture. The encoder reads the full input bidirectionally; the decoder generates output token-by-token, attending to its own past tokens AND the encoder's output via cross-attention.

Pre-Training: Span Corruption

T5 doesn’t use masked LM (BERT-style) — it uses span corruption: randomly select spans of 2–5 consecutive tokens, replace each span with a single sentinel token, and train the decoder to reconstruct the original spans.

Original:  "The cat sat on the mat."
Corrupted: "The cat <extra_id_0> the <extra_id_1>."
Target:    "<extra_id_0> sat on <extra_id_1> mat."

This is more efficient than masking individual tokens and produces better representations.

Scale: T5-Small to T5-11B

ModelParams
T5-small60M
T5-base220M
T5-large770M
T5-XL3B
T5-XXL / T5-11B11B

Flan-T5 (2022) is T5 further fine-tuned on 1,836 language tasks — making it an excellent open-source instruction-following model.

✅ Key Takeaways

  • T5 unifies all NLP tasks as text-in, text-out using task prefixes in natural language.
  • Uses full encoder-decoder architecture with cross-attention connecting the two halves.
  • Pre-trained via span corruption rather than token masking — more efficient.
  • Flan-T5 adds instruction fine-tuning, making it a strong open-source baseline for reasoning tasks.