GPT: Autoregressive Language Modelling at Scale

2 minute read

Published:

TL;DR: The GPT family uses a decoder-only Transformer trained with a simple objective: predict the next token. Masking prevents looking ahead. At sufficient scale, this objective leads to models that can write, reason, code, and follow instructions.

One Task, Massive Scale

GPT (OpenAI, 2018) is conceptually simpler than BERT: predict the next word from all previous words. This is the causal language modelling objective.

Input:  "The cat sat on the"
Target: predict "mat"

Repeat for every position in every document. With enough data and scale, the model learns grammar, facts, reasoning, and style purely as a side-effect of becoming good at this one task.

The Architecture: Decoder-Only

GPT uses only the decoder half of the original Transformer. The key difference from BERT: causal (or masked) self-attention.

In causal attention, each token can only attend to previous tokens (and itself). This is enforced via an upper-triangular mask applied to the attention scores before softmax — future positions become −∞, zeroing out in softmax.

Causal Attention Mask (lower triangular) The cat sat on the The → cat → sat → on → the → −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞ Can attend Masked (−∞)
Figure 1: Causal masking in GPT. Token at position t can only attend to positions 0, 1, ..., t. Future positions are masked out to prevent information leakage.

The Scaling Laws

A landmark 2020 paper (Kaplan et al.) showed that language model performance scales as a power law with three factors:

  • N — number of parameters
  • D — dataset size (tokens)
  • C — compute (FLOPs)

These scaling laws guided the GPT family:

ModelParametersContextData
GPT-1 (2018)117M512 tokensBooks corpus
GPT-2 (2019)1.5B1024 tokensWebText (40GB)
GPT-3 (2020)175B2048 tokensCommonCrawl (~570GB)
GPT-4 (2023)~1T (est.)128K tokensMultimodal

From Language Model to Assistant

Raw GPT generates text by sampling the next token. Turning it into a helpful assistant requires additional steps:

  1. Supervised Fine-Tuning (SFT): fine-tune on high-quality demonstrations of helpful responses.
  2. Reinforcement Learning from Human Feedback (RLHF): train a reward model from human preferences, then use PPO to optimise the policy.

This “InstructGPT” pipeline created ChatGPT.

✅ Key Takeaways

  • GPT is a decoder-only Transformer with causal (left-to-right) attention masking.
  • Trained on the simplest objective: predict the next token. Everything else emerges from scale.
  • Performance improves predictably with more parameters, data, and compute (scaling laws).
  • InstructGPT/ChatGPT extends the base model with SFT + RLHF to follow human instructions.