Flamingo, BLIP, and the Rise of Vision-Language Models
Published:
The Vision-Language Model Problem
CLIP gives a shared embedding space for images and text. But it does not generate text — it classifies and retrieves. The next step: combine a vision encoder with a language model to produce a model that can describe, reason about, and answer questions about images.
Three landmark models — Flamingo, BLIP, and LLaVA — each solved this differently.
Flamingo (DeepMind, 2022)
Architecture
Flamingo freezes a large pre-trained language model (Chinchilla 70B) and a CLIP-like vision encoder, and bridges them with new trainable components:
Perceiver Resampler: pools the vision encoder’s patch features (hundreds of tokens) into a fixed number of visual tokens (64). This decouples the visual sequence length from the LLM context.
Gated Cross-Attention layers: inserted between frozen LLM layers. The text tokens attend to the visual tokens via cross-attention. A learned tanh gate controls how much visual information flows in (initialised to 0 — at start, the LLM behaves as if no images are present).
Frozen LLM layer N
↓
Gated Cross-Attention (text → visual tokens) ← NEW (trained)
↓
Frozen LLM layer N+1
Key Properties
- Frozen LLM: language capabilities are preserved exactly. Visual information is injected without catastrophic forgetting.
- Interleaved image-text input: Flamingo can handle sequences like [image, text, image, text, …] naturally — each image conditions the subsequent text.
- Few-shot learning: by prepending example (image, answer) pairs in context, Flamingo achieves strong few-shot VQA — a first for large vision-language models.
Flamingo’s result
At 80B parameters (Flamingo-80B), state-of-the-art on VQA, COCO captioning, and other benchmarks — without any task-specific fine-tuning in most settings.
BLIP (Salesforce, 2022)
The Noisy Caption Problem
Web-scraped image-text pairs (like CLIP’s WIT) are noisy. BLIP addresses this with a bootstrapping approach:
- Train a model on noisy web data
- Filter web captions using the model (remove mismatched pairs)
- Generate synthetic captions using the model’s captioner
- Retrain on filtered + synthetic data
This self-improvement loop yields cleaner training data, which yields a better model, which yields cleaner data — bootstrapped caption quality.
BLIP Architecture: Unified Encoder-Decoder
BLIP uses a single model with three functionalities:
| Mode | What it does |
|---|---|
| Image-Text Encoder | Contrastive alignment (like CLIP) |
| Image-grounded Text Encoder | Cross-attention fusion for understanding |
| Image-grounded Text Decoder | Autoregressive caption generation |
The same weights are used for all three modes (shared backbone with different attention masks), making BLIP versatile without training multiple models.
BLIP-2: Q-Former
BLIP-2 (2023) adds a lightweight Q-Former (Querying Transformer): a small Transformer with 32 learned query tokens that extract relevant visual information from a frozen ViT into a compact representation for a frozen LLM.
Frozen ViT → Q-Former (32 learned queries ↔ visual patches) → Linear → Frozen LLM
Only Q-Former is trained. Two-stage training: (1) vision-language alignment; (2) generative language pretraining with visual conditioning.
BLIP-2 ViT-G + FlanT5-XL achieves competitive VQA performance with far fewer trainable parameters than Flamingo.
LLaVA (2023): The Minimal Recipe
Architecture: Just a Linear Projection
LLaVA (Visual Instruction Tuning, Liu et al., 2023) stripped vision-language models to their minimal effective form:
CLIP ViT-L/14 → Linear projection W → LLaMA token space
The CLIP encoder extracts visual features (256 patch tokens). A single linear layer projects them into LLaMA’s embedding space. These projected visual tokens are prepended to the text token sequence as if they were language tokens. No cross-attention, no Q-Former, no Perceiver.
Instruction Tuning Data
LLaVA generates visual instruction tuning data using GPT-4: given CLIP-described image captions and bounding box annotations, GPT-4 generates conversations, detailed descriptions, and complex reasoning tasks for the image. This synthetic data teaches the model to follow visual instructions.
Fine-tuned on LLaMA-13B with this data (plus some open VQA datasets), LLaVA achieves 85% of Flamingo’s performance at 1% of the parameter count.
LLaVA-Next and Successors
LLaVA-1.5 replaces the linear projection with a 2-layer MLP — a small improvement. LLaVA-Next adds dynamic high-resolution processing. The LLaVA family has spawned countless open-source VLMs (InternVL, Idefics, Qwen-VL, …) all following the same pattern:
Frozen vision encoder → projection → LLM (partially or fully fine-tuned)
Comparison
| Model | Vision encoder | Connection | LLM | Trainable params |
|---|---|---|---|---|
| Flamingo 80B | NFNet | Gated cross-attn + Perceiver | Chinchilla 70B | ~10B |
| BLIP-2 | ViT-G | Q-Former | FlanT5-XL | ~188M |
| LLaVA-13B | CLIP ViT-L | Linear | LLaMA-13B | ~35M |
LLaVA’s key insight: the vision encoder (CLIP) and LLM are already powerful enough — the bridge can be minimal.
The Modern VLM Recipe
Current state-of-the-art VLMs (GPT-4V, Gemini, Claude’s vision, Qwen-VL) follow evolved versions of these designs:
- Vision encoder: high-resolution CLIP or SigLIP ViT
- Projection: MLP or cross-attention (Q-Former style)
- LLM: large, instruction-tuned language model
- Training: multi-stage (alignment then instruction tuning)
Summary
| Model | Key Innovation |
|---|---|
| Flamingo | Frozen LLM + gated cross-attention; few-shot visual reasoning |
| BLIP | Bootstrapped caption quality; unified encoder-decoder |
| BLIP-2 | Q-Former for parameter-efficient alignment |
| LLaVA | Linear projection suffices; visual instruction tuning |
The evolution shows a clear trend: less architectural complexity, more training data quality. The modern VLM is a CLIP encoder, a small projection, and an LLM — the intelligence comes from scale and data, not from elaborate fusion mechanisms.
