BERT: Bidirectional Transformers for Language Understanding

2 minute read

Published: January 10, 2024

TL;DR: BERT (Devlin et al., Google, 2018) is an encoder-only Transformer pre-trained on two tasks: Masked Language Modelling (predict masked tokens) and Next Sentence Prediction. Fine-tuning on 11 NLP tasks produced state-of-the-art results across the board.

The Problem with Left-to-Right

Language models before BERT (ELMo, GPT-1) generated representations by reading text in one direction — left to right. The word “bank” in “I went to the river bank” and “I withdrew from my bank” would look similar early in processing because the right context hasn’t been seen yet.

BERT’s answer: see the entire sentence from the start, and predict missing pieces using both left and right context simultaneously.

The Two Pre-Training Tasks

Task 1: Masked Language Modelling (MLM)

15% of tokens are randomly masked. The model must predict the original token from the surrounding context.

Input:  "The cat [MASK] on the mat"
Target: predict "sat"

Because the model sees both sides of the mask, it learns truly bidirectional representations.

Task 2: Next Sentence Prediction (NSP)

Given two sentences A and B, predict whether B actually follows A in the original text (50% of the time it does, 50% a random sentence is used).

This teaches the model to understand discourse relationships — useful for QA, entailment, and summarisation tasks. (Later work showed NSP is less important than MLM; many models since drop it.)

Figure 1: BERT sees the whole sentence bidirectionally. The [MASK] token must be predicted using both left and right context. The [CLS] token aggregates sequence-level information for classification tasks.

Fine-Tuning for Downstream Tasks

After pre-training, BERT is fine-tuned by adding a task-specific head on top:

Task	Head	Input
Classification	Linear on [CLS]	Single sequence
NER	Linear on each token	Single sequence
QA	Two linear layers (start/end span)	Question + passage
Entailment	Linear on [CLS]	Sentence A + [SEP] + Sentence B

Fine-tuning takes minutes to hours on a single GPU, even for large models — because the pre-trained weights already encode rich language understanding.

BERT Family

Model	Key difference
BERT-base	12 layers, 110M params
BERT-large	24 layers, 340M params
RoBERTa	More data, longer training, no NSP
ALBERT	Smaller via parameter sharing and factored embeddings
DistilBERT	40% smaller, 97% BERT performance via distillation
DeBERTa	Disentangled attention (content vs. position separate)

✅ Key Takeaways

BERT is an encoder-only Transformer: it builds rich representations but doesn't generate text.
Pre-trained via Masked LM (15% random masks) using both left and right context simultaneously.
Fine-tuning is cheap: add a task head on top of the frozen (or slightly updated) encoder.
Started the "pre-train then fine-tune" paradigm that dominates NLP to this day.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

BERT: Bidirectional Transformers for Language Understanding

The Problem with Left-to-Right

The Two Pre-Training Tasks

Task 1: Masked Language Modelling (MLM)

Task 2: Next Sentence Prediction (NSP)

Fine-Tuning for Downstream Tasks

BERT Family

✅ Key Takeaways

Share on

You May Also Enjoy

GIN: Graph Isomorphism Network — The Most Expressive GNN

GraphSAGE: Inductive Learning on Large Graphs

GAT: Graph Attention Networks

GCN: Graph Convolutional Networks