Transformers: The Architecture That Changed AI

4 minute read

Published:

TL;DR: The Transformer dropped sequential processing in favour of parallel attention over all tokens at once. This simple shift unlocked GPT, BERT, Whisper, AlphaFold, ViT — essentially all of modern AI.

The Problem with the Old Way

Before 2017, the go-to model for text was the Recurrent Neural Network (RNN). It worked like a conveyor belt: read one word, update a hidden state, pass it to the next word. The trouble is that by the time you reach the end of a long sentence, the beginning is already fading — the network forgets.

This is the vanishing gradient problem: information from far-back positions barely influences the model. Researchers patched it with LSTMs and GRUs, but the fundamental bottleneck remained: you can’t parallelise a sequential process. Training was slow, and long-range dependencies were hard to capture.

The Core Insight: Attend to Everything

The 2017 paper Attention Is All You Need (Vaswani et al.) asked: what if you let every word look directly at every other word, with no middle layers in between?

That’s self-attention. Each token computes a score with every other token, learns which ones are relevant, and mixes their information together — in one parallel step. No sequential dependency. No forgetting.

RNN (sequential) the cat sat Must process one-by-one → Transformer (parallel) the cat sat All tokens attend to each other at once Transformer Encoder Stack Input Tokens + Embedding Positional Encoding × N layers Multi-Head Self-Attention + Add & Norm Feed-Forward Network + Add & Norm Output Representations
Figure 1: RNNs process tokens sequentially (left); Transformers attend to all tokens in parallel (right). The bottom shows the encoder stack.

Architecture Walk-Through

A Transformer encoder consists of these building blocks, stacked N times:

1. Token Embedding

Each word (or subword token) is mapped to a dense vector — a point in high-dimensional space where similar words land close together.

2. Positional Encoding

Because attention sees all tokens simultaneously, the model would otherwise have no idea which word comes first. Positional encodings inject position information into each token’s vector before it enters the attention layers. (See the dedicated PE posts for all the variants.)

3. Multi-Head Self-Attention

This is the heart of the Transformer. Each token computes three vectors — a Query (what I’m looking for), a Key (what I offer), and a Value (what I’ll contribute). The model computes pairwise relevance scores, normalises them with a softmax, then mixes the value vectors accordingly. Running this process in parallel across h heads lets the model capture different types of relationships simultaneously.

4. Add & Layer Norm

A residual connection adds the attention output back to the input, then layer normalisation stabilises training. This pattern repeats after every sub-layer and is crucial for training deep stacks.

5. Feed-Forward Network

Two linear layers with a non-linearity (typically GELU or ReLU) applied independently to each token position. This is where the model “thinks” about each token after mixing information via attention.

Where Transformers Are Used Today

DomainModelWhat it does
LanguageGPT-4, LLaMA 3Generate and understand text
LanguageBERT, RoBERTaClassify, extract, embed text
VisionViT, SwinClassify and segment images
AudioWhisperTranscribe speech
BiologyAlphaFold 2Predict protein structure
Multi-modalCLIP, GeminiConnect text + images

Encoders, Decoders, and Hybrids

  • Encoder-only (BERT): reads the full sequence bidirectionally; great for understanding tasks.
  • Decoder-only (GPT): reads left-to-right and predicts the next token; great for generation.
  • Encoder–Decoder (T5, original Transformer): encodes a source sequence, then decodes a target; great for translation and summarisation.

✅ Key Takeaways

  • Transformers replaced sequential RNNs with parallel self-attention.
  • Each layer has two sub-layers: multi-head attention and a feed-forward network, both with residual connections.
  • Positional encodings compensate for the order-agnostic nature of attention.
  • The same architecture works across text, images, audio, and biology by changing inputs and objectives.