Flamingo, BLIP, and the Rise of Vision-Language Models

5 minute read

Published:

TL;DR: Flamingo (DeepMind, 2022) froze a powerful LLM and added cross-attention to visual features — few-shot VQA at scale. BLIP (Salesforce, 2022) bootstrapped better captions with a filter-generator loop. LLaVA (2023) showed that a linear projection from CLIP ViT into LLaMA is sufficient — matching Flamingo with 1% of the parameters. Modern VLMs follow LLaVA's recipe.

The Vision-Language Model Problem

CLIP gives a shared embedding space for images and text. But it does not generate text — it classifies and retrieves. The next step: combine a vision encoder with a language model to produce a model that can describe, reason about, and answer questions about images.

Three landmark models — Flamingo, BLIP, and LLaVA — each solved this differently.

Flamingo (DeepMind, 2022)

Architecture

Flamingo freezes a large pre-trained language model (Chinchilla 70B) and a CLIP-like vision encoder, and bridges them with new trainable components:

  1. Perceiver Resampler: pools the vision encoder’s patch features (hundreds of tokens) into a fixed number of visual tokens (64). This decouples the visual sequence length from the LLM context.

  2. Gated Cross-Attention layers: inserted between frozen LLM layers. The text tokens attend to the visual tokens via cross-attention. A learned tanh gate controls how much visual information flows in (initialised to 0 — at start, the LLM behaves as if no images are present).

Frozen LLM layer N
       ↓
Gated Cross-Attention (text → visual tokens)   ← NEW (trained)
       ↓
Frozen LLM layer N+1

Key Properties

  • Frozen LLM: language capabilities are preserved exactly. Visual information is injected without catastrophic forgetting.
  • Interleaved image-text input: Flamingo can handle sequences like [image, text, image, text, …] naturally — each image conditions the subsequent text.
  • Few-shot learning: by prepending example (image, answer) pairs in context, Flamingo achieves strong few-shot VQA — a first for large vision-language models.

Flamingo’s result

At 80B parameters (Flamingo-80B), state-of-the-art on VQA, COCO captioning, and other benchmarks — without any task-specific fine-tuning in most settings.

BLIP (Salesforce, 2022)

The Noisy Caption Problem

Web-scraped image-text pairs (like CLIP’s WIT) are noisy. BLIP addresses this with a bootstrapping approach:

  1. Train a model on noisy web data
  2. Filter web captions using the model (remove mismatched pairs)
  3. Generate synthetic captions using the model’s captioner
  4. Retrain on filtered + synthetic data

This self-improvement loop yields cleaner training data, which yields a better model, which yields cleaner data — bootstrapped caption quality.

BLIP Architecture: Unified Encoder-Decoder

BLIP uses a single model with three functionalities:

ModeWhat it does
Image-Text EncoderContrastive alignment (like CLIP)
Image-grounded Text EncoderCross-attention fusion for understanding
Image-grounded Text DecoderAutoregressive caption generation

The same weights are used for all three modes (shared backbone with different attention masks), making BLIP versatile without training multiple models.

BLIP-2: Q-Former

BLIP-2 (2023) adds a lightweight Q-Former (Querying Transformer): a small Transformer with 32 learned query tokens that extract relevant visual information from a frozen ViT into a compact representation for a frozen LLM.

Frozen ViT → Q-Former (32 learned queries ↔ visual patches) → Linear → Frozen LLM

Only Q-Former is trained. Two-stage training: (1) vision-language alignment; (2) generative language pretraining with visual conditioning.

BLIP-2 ViT-G + FlanT5-XL achieves competitive VQA performance with far fewer trainable parameters than Flamingo.

LLaVA (2023): The Minimal Recipe

Architecture: Just a Linear Projection

LLaVA (Visual Instruction Tuning, Liu et al., 2023) stripped vision-language models to their minimal effective form:

CLIP ViT-L/14 → Linear projection W → LLaMA token space

The CLIP encoder extracts visual features (256 patch tokens). A single linear layer projects them into LLaMA’s embedding space. These projected visual tokens are prepended to the text token sequence as if they were language tokens. No cross-attention, no Q-Former, no Perceiver.

Why does a linear projection suffice? CLIP features are already semantically rich — the linear layer just needs to change the coordinate system (from CLIP's d-dimensional space to LLaMA's d-dimensional space). The LLM then processes everything jointly via its own self-attention layers.

Instruction Tuning Data

LLaVA generates visual instruction tuning data using GPT-4: given CLIP-described image captions and bounding box annotations, GPT-4 generates conversations, detailed descriptions, and complex reasoning tasks for the image. This synthetic data teaches the model to follow visual instructions.

Fine-tuned on LLaMA-13B with this data (plus some open VQA datasets), LLaVA achieves 85% of Flamingo’s performance at 1% of the parameter count.

LLaVA-Next and Successors

LLaVA-1.5 replaces the linear projection with a 2-layer MLP — a small improvement. LLaVA-Next adds dynamic high-resolution processing. The LLaVA family has spawned countless open-source VLMs (InternVL, Idefics, Qwen-VL, …) all following the same pattern:

Frozen vision encoder → projection → LLM (partially or fully fine-tuned)

Comparison

ModelVision encoderConnectionLLMTrainable params
Flamingo 80BNFNetGated cross-attn + PerceiverChinchilla 70B~10B
BLIP-2ViT-GQ-FormerFlanT5-XL~188M
LLaVA-13BCLIP ViT-LLinearLLaMA-13B~35M

LLaVA’s key insight: the vision encoder (CLIP) and LLM are already powerful enough — the bridge can be minimal.

The Modern VLM Recipe

Current state-of-the-art VLMs (GPT-4V, Gemini, Claude’s vision, Qwen-VL) follow evolved versions of these designs:

  1. Vision encoder: high-resolution CLIP or SigLIP ViT
  2. Projection: MLP or cross-attention (Q-Former style)
  3. LLM: large, instruction-tuned language model
  4. Training: multi-stage (alignment then instruction tuning)

Summary

ModelKey Innovation
FlamingoFrozen LLM + gated cross-attention; few-shot visual reasoning
BLIPBootstrapped caption quality; unified encoder-decoder
BLIP-2Q-Former for parameter-efficient alignment
LLaVALinear projection suffices; visual instruction tuning

The evolution shows a clear trend: less architectural complexity, more training data quality. The modern VLM is a CLIP encoder, a small projection, and an LLM — the intelligence comes from scale and data, not from elaborate fusion mechanisms.