Query, Key, Value: The Intuition Behind QKV

4 minute read

Published:

TL;DR: Q (query) is what you're looking for. K (key) is what each token advertises about itself. V (value) is the information that gets retrieved when a match is found. Together they implement a soft, differentiable information lookup.

The Analogy: A Smart Library

Imagine walking into a library with a question: “I want something about neural networks.”

  • Query (Q): your question — what you’re searching for
  • Key (K): the labels on every book’s spine — what each book is about
  • Value (V): the actual content of each book — what you retrieve when you pick one up

You compare your question (Q) against every spine label (K). The closer the match, the more of that book’s content (V) you retrieve. If three books are slightly relevant and one is very relevant, you blend them proportionally.

This is exactly what attention does — but over tokens in a sequence, and with vectors instead of text labels.

In Transformer Notation

Each token in the input sequence gets three vector representations learned by the model:

Q = X · Wᵩ     K = X · W_K     V = X · W_V

Where X is the token representation and W_Q, W_K, W_V are learned weight matrices. The model learns what to advertise (K), what to ask for (Q), and what to share (V) — and these can be different projections of the same token.

A Token’s Three Faces

Consider the word “bank” in the sentence “The bank approved the loan.”

When “bank” is being asked about (as a key):

  • It advertises: I’m a financial institution

When “bank” is asking questions (as a query):

  • It might ask: What other financial terms are nearby?

When “bank” contributes information (as a value):

  • It provides: its full contextual representation, to be mixed into other tokens’ outputs

A single token plays all three roles simultaneously — as a key for others querying it, as a query seeking information from others, and as a value supplying its content when called.

Why Not Just Use One Matrix?

A natural question: why not compute similarity directly between token representations, without Q, K, V projections?

Two reasons:

1. Asymmetry. The question you ask (Q) and the label you advertise (K) can be different things. The word “bank” might advertise its financial meaning but query for loan-related terms. A single representation forces them to be the same — which is too restrictive.

2. Information compression. The value (V) can be a different, richer projection than the key (K). Keys are optimised for matching; values are optimised for being informative. Separating them lets the model decouple finding information from extracting it.

Key insight: Q and K are both in the same "matching space" (so their dot product is meaningful). V lives in a different "content space" (what actually gets mixed into the output). These are distinct roles, and the model learns each separately.

The Full Attention Computation Step by Step

Given a single query token and a sequence of key-value pairs:

  1. Match: compute q · kᵢ for every token i → raw similarity scores
  2. Scale: divide by √d_k → prevent softmax saturation
  3. Normalise: softmax → convert scores to a probability distribution (attention weights)
  4. Retrieve: weighted sum of values → the output for this query token
output = Σᵢ softmax( q · kᵢ / √d_k ) · vᵢ

The result is a blend of all value vectors, weighted by how much each token’s key matched the query. Tokens with high relevance contribute more; irrelevant tokens contribute near zero.

The Analogy Revisited: A Database

Database conceptAttention equivalent
Search queryQuery vector q
Index keysKey vectors k₁…kₙ
Retrieved recordsValue vectors v₁…vₙ
Exact match (hard)Argmax over scores
Fuzzy match (soft)Softmax-weighted blend

Classic databases return one result (hard lookup). Attention returns a differentiable weighted blend — which means gradients can flow through it and the whole system can be trained end-to-end.

Summary

SymbolWhat it isWhat it does
QQueryWhat this token is looking for
KKeyWhat each token advertises it contains
VValueWhat each token contributes when selected
QKᵀSimilarityHow well query matches each key
SoftmaxNormalisationConverts similarities to weights
Weighted VOutputBlend of values, weighted by attention

QKV is a learned, differentiable, soft database lookup. Once you see it this way, the rest of the Transformer follows naturally.