Transformers: The Architecture That Changed AI

8 minute read

Published: May 26, 2026

TL;DR: The Transformer dropped sequential processing in favour of parallel attention over all tokens at once. This simple shift unlocked GPT, BERT, Whisper, AlphaFold, ViT — essentially all of modern AI.

Paper: "Attention Is All You Need" · arXiv:1706.03762
Authors: A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin
Venue: NeurIPS 2017 · 📄 Read the paper

First page of the Attention Is All You Need paper — Paper preview — Attention Is All You Need (Vaswani et al., 2017).

Original Transformer encoder-decoder architecture from Attention Is All You Need — Figure 1 — The original Transformer diagram is still the best high-level map of the architecture: token embeddings and positional information enter stacked encoder and decoder blocks, while masked self-attention and cross-attention let generation stay autoregressive without losing access to the encoded source sequence. Source: [1].

Why it matters

Transformers are the common backbone behind LLMs, vision-language models, speech models, and a growing share of scientific AI.

Core mechanism

Every token can directly inspect every other token through self-attention, instead of waiting for information to travel step by step.

What to learn next

Attention, QKV, masking, residuals, and positional encodings are the five ideas that make the whole architecture click.

The Problem with the Old Way

Before 2017, the go-to model for text was the Recurrent Neural Network (RNN). It worked like a conveyor belt: read one word, update a hidden state, pass it to the next word. The trouble is that by the time you reach the end of a long sentence, the beginning is already fading — the network forgets.

This is the vanishing gradient problem: information from far-back positions barely influences the model. Researchers patched it with LSTMs and GRUs, but the fundamental bottleneck remained: you can’t parallelise a sequential process. Training was slow, and long-range dependencies were hard to capture.

Animated: information from "animal" fades as the RNN processes each subsequent step. Transformers solve this by attending to every token directly.

The real bottleneck: older sequence models were not only harder to train. They also forced information to move through many intermediate steps, which is exactly the wrong bias when meaning depends on distant words, long documents, or multi-modal context.

The Core Insight: Attend to Everything

The 2017 paper Attention Is All You Need (Vaswani et al.) asked: what if you let every word look directly at every other word, with no middle layers in between?

That’s self-attention. Each token computes a score with every other token, learns which ones are relevant, and mixes their information together — in one parallel step. No sequential dependency. No forgetting.

Scaled dot-product attention pipeline from Attention Is All You Need — Figure 2 — Scaled dot-product attention is the core computation inside the Transformer: queries score keys, scaling keeps those scores numerically well behaved, softmax turns them into weights, and values are mixed accordingly. Source: [1].

The Big Picture in One Pass

If you strip away the implementation details, a Transformer does five things:

Turn tokens into vectors.
Inject position information, because attention alone is order-agnostic.
Let tokens exchange information via self-attention.
Stabilise deep training with residual connections and layer normalisation.
Refine each token independently with a feed-forward network.

That recipe is simple enough to reuse across domains, which is why the same core architecture reappears in language, vision, audio, biology, robotics, and multi-modal systems.

The five-step Transformer pipeline, animated to highlight the sequential data flow. Steps 3–5 form the repeatable encoder block.

Architecture Walk-Through

A Transformer encoder consists of these building blocks, stacked N times:

1. Token Embedding

Each word (or subword token) is mapped to a dense vector — a point in high-dimensional space where similar words land close together.

2. Positional Encoding

Because attention sees all tokens simultaneously, the model would otherwise have no idea which word comes first. Positional encodings inject position information into each token’s vector before it enters the attention layers. (See the dedicated PE posts for all the variants.)

3. Multi-Head Self-Attention

This is the heart of the Transformer. Each token computes three vectors — a Query (what I’m looking for), a Key (what I offer), and a Value (what I’ll contribute). The model computes pairwise relevance scores, normalises them with a softmax, then mixes the value vectors accordingly. Running this process in parallel across h heads lets the model capture different types of relationships simultaneously.

Multi-head attention architecture from Attention Is All You Need — Figure 3 — Multi-head attention repeats the same attention computation in parallel with different learned projections. Afterward, the heads are concatenated and remixed through one final linear layer, which lets the model combine several relational views of the same sequence at once. Source: [1].

4. Add & Layer Norm

A residual connection adds the attention output back to the input, then layer normalisation stabilises training. This pattern repeats after every sub-layer and is crucial for training deep stacks.

5. Feed-Forward Network

Two linear layers with a non-linearity (typically GELU or ReLU) applied independently to each token position. This is where the model “thinks” about each token after mixing information via attention.

Reading Roadmap

Start with Self-Attention and Scaled Dot-Product Attention.
Then learn QKV, attention masks, and multi-head attention.
After that, study positional encodings and their modern long-context variants.
Finally, zoom out with The Transformer Block to see how everything composes into one layer.

Where Transformers Are Used Today

Domain	Model	What it does
Language	GPT-4, LLaMA 3	Generate and understand text
Language	BERT, RoBERTa	Classify, extract, embed text
Vision	ViT, Swin	Classify and segment images
Audio	Whisper	Transcribe speech
Biology	AlphaFold 2	Predict protein structure
Multi-modal	CLIP, Gemini	Connect text + images

Encoders, Decoders, and Hybrids

Encoder-only (BERT): reads the full sequence bidirectionally; great for understanding tasks.
Decoder-only (GPT): reads left-to-right and predicts the next token; great for generation.
Encoder–Decoder (T5, original Transformer): encodes a source sequence, then decodes a target; great for translation and summarisation.

Why the Architecture Scaled So Well

Transformers won not because attention is mathematically elegant, but because the whole stack lines up with modern compute:

self-attention parallelises well on GPUs and TPUs;
the same layer can be repeated dozens or hundreds of times;
the architecture does not assume language specifically, only sequences of tokens;
scaling data, model size, and context length tends to improve performance smoothly.

That combination made Transformers less like a one-off NLP model and more like a general-purpose interface between data and computation.

Comparison table of self-attention, recurrent, and convolutional layers from Attention Is All You Need — Figure 4 — This comparison table captures why the design scaled so well in practice: self-attention keeps the path length between any two tokens at O(1), and unlike recurrent layers it avoids sequential dependence during the main computation. That combination is exactly what made long-range reasoning easier and GPU training far more efficient. Source: [1].

What This Overview Should Leave You With

The Transformer is not one trick. It is a clean composition of simple blocks that together solve three hard problems at once:

how to model long-range dependencies;
how to train efficiently at scale;
how to reuse the same backbone across many data types.

✅ Key Takeaways

Transformers replaced sequential RNNs with parallel self-attention.
Each layer has two sub-layers: multi-head attention and a feed-forward network, both with residual connections.
Positional encodings compensate for the order-agnostic nature of attention.
The same architecture works across text, images, audio, and biology by changing inputs and objectives.

References

[1] Vaswani, A. et al. (2017). Attention Is All You Need.
Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[2] https://www.sscardapane.it/alice-book/
Zhang, Lipton, Li, and Smola. Dive into Deep Learning, chapters on attention and Transformers.

📚 Read Next

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Transformers: The Architecture That Changed AI

Why it matters

Core mechanism

What to learn next

The Problem with the Old Way

The Core Insight: Attend to Everything

The Big Picture in One Pass

Architecture Walk-Through

1. Token Embedding

2. Positional Encoding

3. Multi-Head Self-Attention

4. Add & Layer Norm

5. Feed-Forward Network

Reading Roadmap

Where Transformers Are Used Today

Encoders, Decoders, and Hybrids

Why the Architecture Scaled So Well

What This Overview Should Leave You With

✅ Key Takeaways

References

📚 Read Next

Share on

You May Also Enjoy

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization

Alessio Borgi

Why it matters

Core mechanism

What to learn next

The Problem with the Old Way

The Core Insight: Attend to Everything

The Big Picture in One Pass

Architecture Walk-Through

1. Token Embedding

2. Positional Encoding

3. Multi-Head Self-Attention

4. Add & Layer Norm

5. Feed-Forward Network

Reading Roadmap

Where Transformers Are Used Today

Encoders, Decoders, and Hybrids

Why the Architecture Scaled So Well

What This Overview Should Leave You With

✅ Key Takeaways

References

📚 Read Next

Share on

You May Also Enjoy

📄 Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

📄 Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

📄 Activation Functions in Neural Networks: Why Non-Linearity Matters

📄 FoPE: Fourier Position Embedding for Length Generalization

Output, Gated, and Special Activations: Softmax, GLU, SIREN, and More

Modern Activation Functions: GELU, SiLU, Mish, and Smooth Gating

Activation Functions in Neural Networks: Why Non-Linearity Matters

FoPE: Fourier Position Embedding for Length Generalization