Query, Key, Value: The Intuition Behind QKV

4 minute read

Published: March 02, 2024

TL;DR: Q (query) is what you're looking for. K (key) is what each token advertises about itself. V (value) is the information that gets retrieved when a match is found. Together they implement a soft, differentiable information lookup.

The Analogy: A Smart Library

Imagine walking into a library with a question: “I want something about neural networks.”

Query (Q): your question — what you’re searching for
Key (K): the labels on every book’s spine — what each book is about
Value (V): the actual content of each book — what you retrieve when you pick one up

You compare your question (Q) against every spine label (K). The closer the match, the more of that book’s content (V) you retrieve. If three books are slightly relevant and one is very relevant, you blend them proportionally.

This is exactly what attention does — but over tokens in a sequence, and with vectors instead of text labels.

In Transformer Notation

Each token in the input sequence gets three vector representations learned by the model:

Q = X · Wᵩ K = X · W_K V = X · W_V

Where X is the token representation and W_Q, W_K, W_V are learned weight matrices. The model learns what to advertise (K), what to ask for (Q), and what to share (V) — and these can be different projections of the same token.

A Token’s Three Faces

Consider the word “bank” in the sentence “The bank approved the loan.”

When “bank” is being asked about (as a key):

It advertises: I’m a financial institution

When “bank” is asking questions (as a query):

It might ask: What other financial terms are nearby?

When “bank” contributes information (as a value):

It provides: its full contextual representation, to be mixed into other tokens’ outputs

A single token plays all three roles simultaneously — as a key for others querying it, as a query seeking information from others, and as a value supplying its content when called.

Why Not Just Use One Matrix?

A natural question: why not compute similarity directly between token representations, without Q, K, V projections?

Two reasons:

1. Asymmetry. The question you ask (Q) and the label you advertise (K) can be different things. The word “bank” might advertise its financial meaning but query for loan-related terms. A single representation forces them to be the same — which is too restrictive.

2. Information compression. The value (V) can be a different, richer projection than the key (K). Keys are optimised for matching; values are optimised for being informative. Separating them lets the model decouple finding information from extracting it.

Key insight: Q and K are both in the same "matching space" (so their dot product is meaningful). V lives in a different "content space" (what actually gets mixed into the output). These are distinct roles, and the model learns each separately.

The Full Attention Computation Step by Step

Given a single query token and a sequence of key-value pairs:

Match: compute q · kᵢ for every token i → raw similarity scores
Scale: divide by √d_k → prevent softmax saturation
Normalise: softmax → convert scores to a probability distribution (attention weights)
Retrieve: weighted sum of values → the output for this query token

output = Σᵢ softmax( q · kᵢ / √d_k ) · vᵢ

The result is a blend of all value vectors, weighted by how much each token’s key matched the query. Tokens with high relevance contribute more; irrelevant tokens contribute near zero.

The Analogy Revisited: A Database

Database concept	Attention equivalent
Search query	Query vector q
Index keys	Key vectors k₁…kₙ
Retrieved records	Value vectors v₁…vₙ
Exact match (hard)	Argmax over scores
Fuzzy match (soft)	Softmax-weighted blend

Classic databases return one result (hard lookup). Attention returns a differentiable weighted blend — which means gradients can flow through it and the whole system can be trained end-to-end.

Summary

Symbol	What it is	What it does
Q	Query	What this token is looking for
K	Key	What each token advertises it contains
V	Value	What each token contributes when selected
QKᵀ	Similarity	How well query matches each key
Softmax	Normalisation	Converts similarities to weights
Weighted V	Output	Blend of values, weighted by attention

QKV is a learned, differentiable, soft database lookup. Once you see it this way, the rest of the Transformer follows naturally.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Alessio Borgi

Query, Key, Value: The Intuition Behind QKV

The Analogy: A Smart Library

In Transformer Notation

A Token’s Three Faces

Why Not Just Use One Matrix?

The Full Attention Computation Step by Step

The Analogy Revisited: A Database

Summary

Share on

You May Also Enjoy

Flamingo, BLIP, and the Rise of Vision-Language Models

CLIP: Connecting Images and Text with Contrastive Learning

MAE: Masked Autoencoders Are Scalable Vision Learners

DeiT: Training ViTs Efficiently Without Large Datasets