ML//Transformer//attention//Q, K, V
Three roles every token plays simultaneously in attention:
Three roles every token plays simultaneously in attention:
Query (Q): "what am I looking for?" — each token broadcasts a 128-dim question via W_Q.
Key (K): "what do I have to offer?" — each token advertises its relevance via W_K.
Value (V): "what information do I carry?" — the actual content that flows if attention fires, via W_V.
Q·K (dot product) measures alignment between what one token seeks and what another offers → attention scores
Why can't V just be the original embedding? Because W_V learns to extract only the subespace relevant for this head — "banco" attending to "río" should receive river-related features, not all of "río".
Each head has its own Q, K, V — different learned projections, different questions asked.