ML//Transformer//attention//Q, K, V

2026-03-07

Three roles every token plays simultaneously in attention:

Three roles every token plays simultaneously in attention:

Query (Q): "what am I looking for?" Each token broadcasts a 128-dim question via W_Q.

Key (K): "what do I have to offer?" Each token advertises its relevance via W_K.

Value (V): "what information do I carry?", the actual content that flows if attention fires, via W_V.

Q·K (dot product) measures alignment between what one token seeks and what another offers → attention scores

Why can't V just be the original embedding? Because W_V learns to extract only the subespace relevant for this head: "banco" attending to "río" should receive river-related features, not all of "río".

Each head has its own Q, K, V: different learned projections, different questions asked.