CLIP: Connecting Images and Text with Contrastive Learning
Published:
The Core Insight: Supervision from Captions
Traditional vision models are trained on human-annotated labels: 1000 ImageNet classes, 80 COCO categories. This is expensive and limits what the model can understand to pre-defined categories.
Images on the internet come with natural language descriptions — alt text, captions, hashtags. CLIP’s insight: use text as the supervision signal. If a model can match images to their descriptions in a shared embedding space, it has learned rich visual semantics — and the supervision is free, at scale.
The Architecture: Two Encoders, One Space
CLIP trains two independent encoders:
- Vision encoder: a ResNet or ViT that maps an image → embedding vector e_I ∈ ℝ^d
- Text encoder: a Transformer that maps a caption → embedding vector e_T ∈ ℝ^d
Both are projected to the same d-dimensional embedding space via learned linear projections. Embeddings are L2-normalised.
The goal of training: for matched (image, caption) pairs, e_I · e_T should be large. For mismatched pairs, it should be small.
The Contrastive Objective (InfoNCE)
Given a batch of N image-text pairs, CLIP computes an N×N matrix of similarity scores:
Where τ is a learned temperature parameter. The correct pairs are on the diagonal (i=j). The loss encourages the diagonal to be the maximum in each row and column:
This is symmetric cross-entropy: each image should retrieve its caption (row-wise), and each caption should retrieve its image (column-wise). With N=32,768 negatives per batch, the model must find the correct match among thousands of distractors.
Training Data: WebImageText (WIT)
CLIP was trained on 400 million (image, text) pairs scraped from the internet. The captions are noisy — many are not clean descriptions but are alt-text, file names, or tangentially related sentences. CLIP learns to be robust to this noise through scale.
No manual annotation was used. The labour of 400M annotators was replaced by natural web data.
Zero-Shot Classification: A Key Property
After training, CLIP enables zero-shot classification without any fine-tuning:
- Encode all class names as text: “a photo of a cat”, “a photo of a dog”, …
- Encode the query image
- Find the closest text embedding (cosine similarity)
- The closest class is the prediction
On ImageNet-1k (1000 classes), CLIP ViT-L/14 achieves 75.3% zero-shot top-1 accuracy — matching a ResNet-50 trained with full ImageNet supervision. This was striking: no task-specific training, no labelled data for ImageNet, yet competitive performance.
More importantly, CLIP generalises across domains (sketch recognition, satellite images, medical images, MNIST) where supervised models trained on natural images fail.
CLIP in Modern AI Systems
CLIP’s shared vision-language space has become ubiquitous:
DALL-E 2 (OpenAI): uses CLIP image embeddings as the conditioning signal for diffusion-based image generation. Text → CLIP text embedding → (unCLIP) CLIP image embedding → diffusion decoder → image.
Stable Diffusion: uses CLIP text encoder to condition the UNet denoiser via cross-attention. The text conditioning is CLIP’s text embedding at each denoising step.
LLaVA / LLaVA-Next: uses a CLIP ViT as the vision encoder to extract patch features, which are mapped (via a linear projection or MLP) into the token space of an LLM.
OpenCLIP: an open reproduction of CLIP, trained on LAION-5B (5 billion image-text pairs). SigLIP (Google) further improves CLIP training with a sigmoid loss instead of softmax.
CLIP vs Supervised ViT
| Property | Supervised ViT | CLIP |
|---|---|---|
| Training supervision | Manual labels | Natural language captions |
| Data scale | ~1M labelled | 400M+ web pairs |
| Zero-shot transfer | Poor | Strong |
| Domain generalisation | Limited | Broad |
| Text-image retrieval | None | Native |
| Classification accuracy (IN) | Higher (fine-tuned) | Competitive (zero-shot) |
Limitations
- CLIP struggles with fine-grained counting (“three dogs”) and spatial reasoning (“the dog to the left of the cat”)
- Text descriptions in training are biased toward simple captions; complex compositional descriptions are harder
- CLIP embeddings can be sensitive to prompt phrasing; prompt engineering helps (“a photo of a {class}”, “a {class} in the wild”, …)
Summary
| Property | Value |
|---|---|
| Architecture | Vision encoder + Text encoder (separate) |
| Objective | Contrastive (InfoNCE), N=32k negatives/batch |
| Training data | 400M web image-text pairs (no annotation) |
| Key capability | Zero-shot classification, image-text retrieval |
| Zero-shot ImageNet | 75.3% (ViT-L/14) |
| Downstream use | Stable Diffusion, DALL-E 2, LLaVA, SigLIP |
CLIP established that language supervision at scale beats label supervision at scale for vision pre-training. It is the foundation of modern vision-language AI.
