BERT: Bidirectional Transformers for Language Understanding
Published:
The Problem with Left-to-Right
Language models before BERT (ELMo, GPT-1) generated representations by reading text in one direction — left to right. The word “bank” in “I went to the river bank” and “I withdrew from my bank” would look similar early in processing because the right context hasn’t been seen yet.
BERT’s answer: see the entire sentence from the start, and predict missing pieces using both left and right context simultaneously.
The Two Pre-Training Tasks
Task 1: Masked Language Modelling (MLM)
15% of tokens are randomly masked. The model must predict the original token from the surrounding context.
Input: "The cat [MASK] on the mat"
Target: predict "sat"
Because the model sees both sides of the mask, it learns truly bidirectional representations.
Task 2: Next Sentence Prediction (NSP)
Given two sentences A and B, predict whether B actually follows A in the original text (50% of the time it does, 50% a random sentence is used).
This teaches the model to understand discourse relationships — useful for QA, entailment, and summarisation tasks. (Later work showed NSP is less important than MLM; many models since drop it.)
Fine-Tuning for Downstream Tasks
After pre-training, BERT is fine-tuned by adding a task-specific head on top:
| Task | Head | Input |
|---|---|---|
| Classification | Linear on [CLS] | Single sequence |
| NER | Linear on each token | Single sequence |
| QA | Two linear layers (start/end span) | Question + passage |
| Entailment | Linear on [CLS] | Sentence A + [SEP] + Sentence B |
Fine-tuning takes minutes to hours on a single GPU, even for large models — because the pre-trained weights already encode rich language understanding.
BERT Family
| Model | Key difference |
|---|---|
| BERT-base | 12 layers, 110M params |
| BERT-large | 24 layers, 340M params |
| RoBERTa | More data, longer training, no NSP |
| ALBERT | Smaller via parameter sharing and factored embeddings |
| DistilBERT | 40% smaller, 97% BERT performance via distillation |
| DeBERTa | Disentangled attention (content vs. position separate) |
✅ Key Takeaways
- BERT is an encoder-only Transformer: it builds rich representations but doesn't generate text.
- Pre-trained via Masked LM (15% random masks) using both left and right context simultaneously.
- Fine-tuning is cheap: add a task head on top of the frozen (or slightly updated) encoder.
- Started the "pre-train then fine-tune" paradigm that dominates NLP to this day.
