GPT: Autoregressive Language Modelling at Scale
Published:
One Task, Massive Scale
GPT (OpenAI, 2018) is conceptually simpler than BERT: predict the next word from all previous words. This is the causal language modelling objective.
Input: "The cat sat on the"
Target: predict "mat"
Repeat for every position in every document. With enough data and scale, the model learns grammar, facts, reasoning, and style purely as a side-effect of becoming good at this one task.
The Architecture: Decoder-Only
GPT uses only the decoder half of the original Transformer. The key difference from BERT: causal (or masked) self-attention.
In causal attention, each token can only attend to previous tokens (and itself). This is enforced via an upper-triangular mask applied to the attention scores before softmax — future positions become −∞, zeroing out in softmax.
The Scaling Laws
A landmark 2020 paper (Kaplan et al.) showed that language model performance scales as a power law with three factors:
- N — number of parameters
- D — dataset size (tokens)
- C — compute (FLOPs)
These scaling laws guided the GPT family:
| Model | Parameters | Context | Data |
|---|---|---|---|
| GPT-1 (2018) | 117M | 512 tokens | Books corpus |
| GPT-2 (2019) | 1.5B | 1024 tokens | WebText (40GB) |
| GPT-3 (2020) | 175B | 2048 tokens | CommonCrawl (~570GB) |
| GPT-4 (2023) | ~1T (est.) | 128K tokens | Multimodal |
From Language Model to Assistant
Raw GPT generates text by sampling the next token. Turning it into a helpful assistant requires additional steps:
- Supervised Fine-Tuning (SFT): fine-tune on high-quality demonstrations of helpful responses.
- Reinforcement Learning from Human Feedback (RLHF): train a reward model from human preferences, then use PPO to optimise the policy.
This “InstructGPT” pipeline created ChatGPT.
✅ Key Takeaways
- GPT is a decoder-only Transformer with causal (left-to-right) attention masking.
- Trained on the simplest objective: predict the next token. Everything else emerges from scale.
- Performance improves predictably with more parameters, data, and compute (scaling laws).
- InstructGPT/ChatGPT extends the base model with SFT + RLHF to follow human instructions.
