BERT: Bidirectional Transformers for Language Understanding

2 minute read

Published:

TL;DR: BERT (Devlin et al., Google, 2018) is an encoder-only Transformer pre-trained on two tasks: Masked Language Modelling (predict masked tokens) and Next Sentence Prediction. Fine-tuning on 11 NLP tasks produced state-of-the-art results across the board.

The Problem with Left-to-Right

Language models before BERT (ELMo, GPT-1) generated representations by reading text in one direction — left to right. The word “bank” in “I went to the river bank and “I withdrew from my bank would look similar early in processing because the right context hasn’t been seen yet.

BERT’s answer: see the entire sentence from the start, and predict missing pieces using both left and right context simultaneously.

The Two Pre-Training Tasks

Task 1: Masked Language Modelling (MLM)

15% of tokens are randomly masked. The model must predict the original token from the surrounding context.

Input:  "The cat [MASK] on the mat"
Target: predict "sat"

Because the model sees both sides of the mask, it learns truly bidirectional representations.

Task 2: Next Sentence Prediction (NSP)

Given two sentences A and B, predict whether B actually follows A in the original text (50% of the time it does, 50% a random sentence is used).

This teaches the model to understand discourse relationships — useful for QA, entailment, and summarisation tasks. (Later work showed NSP is less important than MLM; many models since drop it.)

BERT Pre-training: Masked Language Modelling Input sequence [CLS] The cat [MASK] on the mat Bidirectional self-attention: every token sees every other BERT Encoder Stack (N × Transformer layers) Predict: "sat" ✓ [CLS] repr → NSP / classification
Figure 1: BERT sees the whole sentence bidirectionally. The [MASK] token must be predicted using both left and right context. The [CLS] token aggregates sequence-level information for classification tasks.

Fine-Tuning for Downstream Tasks

After pre-training, BERT is fine-tuned by adding a task-specific head on top:

TaskHeadInput
ClassificationLinear on [CLS]Single sequence
NERLinear on each tokenSingle sequence
QATwo linear layers (start/end span)Question + passage
EntailmentLinear on [CLS]Sentence A + [SEP] + Sentence B

Fine-tuning takes minutes to hours on a single GPU, even for large models — because the pre-trained weights already encode rich language understanding.

BERT Family

ModelKey difference
BERT-base12 layers, 110M params
BERT-large24 layers, 340M params
RoBERTaMore data, longer training, no NSP
ALBERTSmaller via parameter sharing and factored embeddings
DistilBERT40% smaller, 97% BERT performance via distillation
DeBERTaDisentangled attention (content vs. position separate)

✅ Key Takeaways

  • BERT is an encoder-only Transformer: it builds rich representations but doesn't generate text.
  • Pre-trained via Masked LM (15% random masks) using both left and right context simultaneously.
  • Fine-tuning is cheap: add a task head on top of the frozen (or slightly updated) encoder.
  • Started the "pre-train then fine-tune" paradigm that dominates NLP to this day.