T5: Every NLP Task as Text-to-Text

2 minute read

Published: January 12, 2024

TL;DR: T5 (Raffel et al., Google, 2019) recasts every NLP task as: take text in, produce text out. Translation, summarisation, classification, QA — all trained jointly with a single cross-entropy loss. Encoder-decoder architecture with relative position biases.

The Unifying Idea

Different NLP models used to need different architectures and training objectives. BERT needs a classification head for sentiment; a seq2seq model for translation; another model for QA.

T5 asks: what if we just described the task in natural language as part of the input?

translate English to German: That is good. → Das ist gut.

summarize: Scientists at NASA discovered... → NASA finds new water on Mars.

sentiment: The movie was terrible. → negative

cola sentence: She run fast. → not acceptable

Every task gets a text prefix that tells the model what to do. The model is trained with teacher-forcing on the target text. At inference, it generates the answer token-by-token.

Architecture: Full Encoder-Decoder

T5 uses the original Transformer’s full encoder-decoder:

Encoder: reads the input (prefix + content), builds rich contextual representations.
Decoder: generates the output token-by-token, attending to both its own previous outputs (causal self-attention) and the encoder representations (cross-attention).

This is different from BERT (encoder only) and GPT (decoder only).

Figure 1: T5's encoder-decoder architecture. The encoder reads the full input bidirectionally; the decoder generates output token-by-token, attending to its own past tokens AND the encoder's output via cross-attention.

Pre-Training: Span Corruption

T5 doesn’t use masked LM (BERT-style) — it uses span corruption: randomly select spans of 2–5 consecutive tokens, replace each span with a single sentinel token, and train the decoder to reconstruct the original spans.

Original:  "The cat sat on the mat."
Corrupted: "The cat <extra_id_0> the <extra_id_1>."
Target:    "<extra_id_0> sat on <extra_id_1> mat."

This is more efficient than masking individual tokens and produces better representations.

Scale: T5-Small to T5-11B

Model	Params
T5-small	60M
T5-base	220M
T5-large	770M
T5-XL	3B
T5-XXL / T5-11B	11B

Flan-T5 (2022) is T5 further fine-tuned on 1,836 language tasks — making it an excellent open-source instruction-following model.

✅ Key Takeaways

T5 unifies all NLP tasks as text-in, text-out using task prefixes in natural language.
Uses full encoder-decoder architecture with cross-attention connecting the two halves.
Pre-trained via span corruption rather than token masking — more efficient.
Flan-T5 adds instruction fine-tuning, making it a strong open-source baseline for reasoning tasks.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

T5: Every NLP Task as Text-to-Text

The Unifying Idea

Architecture: Full Encoder-Decoder

Pre-Training: Span Corruption

Scale: T5-Small to T5-11B

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks