Transformers: The Architecture That Changed AI
Published:
The Problem with the Old Way
Before 2017, the go-to model for text was the Recurrent Neural Network (RNN). It worked like a conveyor belt: read one word, update a hidden state, pass it to the next word. The trouble is that by the time you reach the end of a long sentence, the beginning is already fading — the network forgets.
This is the vanishing gradient problem: information from far-back positions barely influences the model. Researchers patched it with LSTMs and GRUs, but the fundamental bottleneck remained: you can’t parallelise a sequential process. Training was slow, and long-range dependencies were hard to capture.
The Core Insight: Attend to Everything
The 2017 paper Attention Is All You Need (Vaswani et al.) asked: what if you let every word look directly at every other word, with no middle layers in between?
That’s self-attention. Each token computes a score with every other token, learns which ones are relevant, and mixes their information together — in one parallel step. No sequential dependency. No forgetting.
Architecture Walk-Through
A Transformer encoder consists of these building blocks, stacked N times:
1. Token Embedding
Each word (or subword token) is mapped to a dense vector — a point in high-dimensional space where similar words land close together.
2. Positional Encoding
Because attention sees all tokens simultaneously, the model would otherwise have no idea which word comes first. Positional encodings inject position information into each token’s vector before it enters the attention layers. (See the dedicated PE posts for all the variants.)
3. Multi-Head Self-Attention
This is the heart of the Transformer. Each token computes three vectors — a Query (what I’m looking for), a Key (what I offer), and a Value (what I’ll contribute). The model computes pairwise relevance scores, normalises them with a softmax, then mixes the value vectors accordingly. Running this process in parallel across h heads lets the model capture different types of relationships simultaneously.
4. Add & Layer Norm
A residual connection adds the attention output back to the input, then layer normalisation stabilises training. This pattern repeats after every sub-layer and is crucial for training deep stacks.
5. Feed-Forward Network
Two linear layers with a non-linearity (typically GELU or ReLU) applied independently to each token position. This is where the model “thinks” about each token after mixing information via attention.
Where Transformers Are Used Today
| Domain | Model | What it does |
|---|---|---|
| Language | GPT-4, LLaMA 3 | Generate and understand text |
| Language | BERT, RoBERTa | Classify, extract, embed text |
| Vision | ViT, Swin | Classify and segment images |
| Audio | Whisper | Transcribe speech |
| Biology | AlphaFold 2 | Predict protein structure |
| Multi-modal | CLIP, Gemini | Connect text + images |
Encoders, Decoders, and Hybrids
- Encoder-only (BERT): reads the full sequence bidirectionally; great for understanding tasks.
- Decoder-only (GPT): reads left-to-right and predicts the next token; great for generation.
- Encoder–Decoder (T5, original Transformer): encodes a source sequence, then decodes a target; great for translation and summarisation.
✅ Key Takeaways
- Transformers replaced sequential RNNs with parallel self-attention.
- Each layer has two sub-layers: multi-head attention and a feed-forward network, both with residual connections.
- Positional encodings compensate for the order-agnostic nature of attention.
- The same architecture works across text, images, audio, and biology by changing inputs and objectives.
