CLIP: Connecting Images and Text with Contrastive Learning

5 minute read

Published: March 17, 2024

TL;DR: CLIP (Radford et al., OpenAI, 2021) trains two encoders — one for images, one for text — using a contrastive objective on 400M web-scraped image-text pairs. Matched pairs should have similar embeddings; unmatched pairs should be dissimilar. The result: a shared vision-language space that enables zero-shot classification, image retrieval, and forms the backbone of DALL-E, Stable Diffusion, and LLaVA.

The Core Insight: Supervision from Captions

Traditional vision models are trained on human-annotated labels: 1000 ImageNet classes, 80 COCO categories. This is expensive and limits what the model can understand to pre-defined categories.

Images on the internet come with natural language descriptions — alt text, captions, hashtags. CLIP’s insight: use text as the supervision signal. If a model can match images to their descriptions in a shared embedding space, it has learned rich visual semantics — and the supervision is free, at scale.

The Architecture: Two Encoders, One Space

CLIP trains two independent encoders:

Vision encoder: a ResNet or ViT that maps an image → embedding vector e_I ∈ ℝ^d
Text encoder: a Transformer that maps a caption → embedding vector e_T ∈ ℝ^d

Both are projected to the same d-dimensional embedding space via learned linear projections. Embeddings are L2-normalised.

The goal of training: for matched (image, caption) pairs, e_I · e_T should be large. For mismatched pairs, it should be small.

The Contrastive Objective (InfoNCE)

Given a batch of N image-text pairs, CLIP computes an N×N matrix of similarity scores:

S[i, j] = e_Iᵢ · e_Tⱼ / τ

Where τ is a learned temperature parameter. The correct pairs are on the diagonal (i=j). The loss encourages the diagonal to be the maximum in each row and column:

L = −(1/N) Σᵢ [ log exp(S[i,i]) / Σⱼ exp(S[i,j]) + log exp(S[i,i]) / Σⱼ exp(S[j,i]) ]

This is symmetric cross-entropy: each image should retrieve its caption (row-wise), and each caption should retrieve its image (column-wise). With N=32,768 negatives per batch, the model must find the correct match among thousands of distractors.

Why contrastive and not generative? Generative approaches (predict the caption word by word) are computationally expensive and force the model to model language generation, not just vision-language alignment. Contrastive learning directly optimises the embedding alignment — simpler, faster, and scales better.

Training Data: WebImageText (WIT)

CLIP was trained on 400 million (image, text) pairs scraped from the internet. The captions are noisy — many are not clean descriptions but are alt-text, file names, or tangentially related sentences. CLIP learns to be robust to this noise through scale.

No manual annotation was used. The labour of 400M annotators was replaced by natural web data.

Zero-Shot Classification: A Key Property

After training, CLIP enables zero-shot classification without any fine-tuning:

Encode all class names as text: “a photo of a cat”, “a photo of a dog”, …
Encode the query image
Find the closest text embedding (cosine similarity)
The closest class is the prediction

On ImageNet-1k (1000 classes), CLIP ViT-L/14 achieves 75.3% zero-shot top-1 accuracy — matching a ResNet-50 trained with full ImageNet supervision. This was striking: no task-specific training, no labelled data for ImageNet, yet competitive performance.

More importantly, CLIP generalises across domains (sketch recognition, satellite images, medical images, MNIST) where supervised models trained on natural images fail.

CLIP in Modern AI Systems

CLIP’s shared vision-language space has become ubiquitous:

DALL-E 2 (OpenAI): uses CLIP image embeddings as the conditioning signal for diffusion-based image generation. Text → CLIP text embedding → (unCLIP) CLIP image embedding → diffusion decoder → image.

Stable Diffusion: uses CLIP text encoder to condition the UNet denoiser via cross-attention. The text conditioning is CLIP’s text embedding at each denoising step.

LLaVA / LLaVA-Next: uses a CLIP ViT as the vision encoder to extract patch features, which are mapped (via a linear projection or MLP) into the token space of an LLM.

OpenCLIP: an open reproduction of CLIP, trained on LAION-5B (5 billion image-text pairs). SigLIP (Google) further improves CLIP training with a sigmoid loss instead of softmax.

CLIP vs Supervised ViT

Property	Supervised ViT	CLIP
Training supervision	Manual labels	Natural language captions
Data scale	~1M labelled	400M+ web pairs
Zero-shot transfer	Poor	Strong
Domain generalisation	Limited	Broad
Text-image retrieval	None	Native
Classification accuracy (IN)	Higher (fine-tuned)	Competitive (zero-shot)

Limitations

CLIP struggles with fine-grained counting (“three dogs”) and spatial reasoning (“the dog to the left of the cat”)
Text descriptions in training are biased toward simple captions; complex compositional descriptions are harder
CLIP embeddings can be sensitive to prompt phrasing; prompt engineering helps (“a photo of a {class}”, “a {class} in the wild”, …)

Summary

Property	Value
Architecture	Vision encoder + Text encoder (separate)
Objective	Contrastive (InfoNCE), N=32k negatives/batch
Training data	400M web image-text pairs (no annotation)
Key capability	Zero-shot classification, image-text retrieval
Zero-shot ImageNet	75.3% (ViT-L/14)
Downstream use	Stable Diffusion, DALL-E 2, LLaVA, SigLIP

CLIP established that language supervision at scale beats label supervision at scale for vision pre-training. It is the foundation of modern vision-language AI.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

CLIP: Connecting Images and Text with Contrastive Learning

The Core Insight: Supervision from Captions

The Architecture: Two Encoders, One Space

The Contrastive Objective (InfoNCE)

Training Data: WebImageText (WIT)

Zero-Shot Classification: A Key Property

CLIP in Modern AI Systems

CLIP vs Supervised ViT

Limitations

Summary

Share on

You May Also Enjoy

Flamingo, BLIP, and the Rise of Vision-Language Models

MAE: Masked Autoencoders Are Scalable Vision Learners

DeiT: Training ViTs Efficiently Without Large Datasets

Class Token vs Pooling in Vision Transformers