GPT: Autoregressive Language Modelling at Scale

2 minute read

Published: January 11, 2024

TL;DR: The GPT family uses a decoder-only Transformer trained with a simple objective: predict the next token. Masking prevents looking ahead. At sufficient scale, this objective leads to models that can write, reason, code, and follow instructions.

One Task, Massive Scale

GPT (OpenAI, 2018) is conceptually simpler than BERT: predict the next word from all previous words. This is the causal language modelling objective.

Input:  "The cat sat on the"
Target: predict "mat"

Repeat for every position in every document. With enough data and scale, the model learns grammar, facts, reasoning, and style purely as a side-effect of becoming good at this one task.

The Architecture: Decoder-Only

GPT uses only the decoder half of the original Transformer. The key difference from BERT: causal (or masked) self-attention.

In causal attention, each token can only attend to previous tokens (and itself). This is enforced via an upper-triangular mask applied to the attention scores before softmax — future positions become −∞, zeroing out in softmax.

Figure 1: Causal masking in GPT. Token at position t can only attend to positions 0, 1, ..., t. Future positions are masked out to prevent information leakage.

The Scaling Laws

A landmark 2020 paper (Kaplan et al.) showed that language model performance scales as a power law with three factors:

N — number of parameters
D — dataset size (tokens)
C — compute (FLOPs)

These scaling laws guided the GPT family:

Model	Parameters	Context	Data
GPT-1 (2018)	117M	512 tokens	Books corpus
GPT-2 (2019)	1.5B	1024 tokens	WebText (40GB)
GPT-3 (2020)	175B	2048 tokens	CommonCrawl (~570GB)
GPT-4 (2023)	~1T (est.)	128K tokens	Multimodal

From Language Model to Assistant

Raw GPT generates text by sampling the next token. Turning it into a helpful assistant requires additional steps:

Supervised Fine-Tuning (SFT): fine-tune on high-quality demonstrations of helpful responses.
Reinforcement Learning from Human Feedback (RLHF): train a reward model from human preferences, then use PPO to optimise the policy.

This “InstructGPT” pipeline created ChatGPT.

✅ Key Takeaways

GPT is a decoder-only Transformer with causal (left-to-right) attention masking.
Trained on the simplest objective: predict the next token. Everything else emerges from scale.
Performance improves predictably with more parameters, data, and compute (scaling laws).
InstructGPT/ChatGPT extends the base model with SFT + RLHF to follow human instructions.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

GPT: Autoregressive Language Modelling at Scale

One Task, Massive Scale

The Architecture: Decoder-Only

The Scaling Laws

From Language Model to Assistant

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks