ML//Transformer//attention//Q, K, V

Three roles every token plays simultaneously in attention:


Three roles every token plays simultaneously in attention:

Query (Q): "what am I looking for?" — each token broadcasts a 128-dim question via W_Q.

Key (K): "what do I have to offer?" — each token advertises its relevance via W_K.

Value (V): "what information do I carry?" — the actual content that flows if attention fires, via W_V.

Q·K (dot product) measures alignment between what one token seeks and what another offers → attention scores

Why can't V just be the original embedding? Because W_V learns to extract only the subespace relevant for this head — "banco" attending to "río" should receive river-related features, not all of "río".

Each head has its own Q, K, V — different learned projections, different questions asked.