ML//Transformer//attention//cross-attention
Q from one sequence, K and V from another — how the decoder "looks at" the encoder output.
Q from one sequence, K and V from another — how the decoder "looks at" the encoder output.
No causal masking — the decoder can see ALL encoder tokens freely. Makes sense: when translating, you should see the whole source sentence.
The Q starts from [BOS] — the cold start. "How do I begin translating when I haven't generated anything yet?" The [BOS] embedding, processed by decoder layers, IS the initial query.
K and V from the encoder are computed once and cached — they don't change as the decoder generates token by token.
Also how DALL-E and Stable Diffusion condition image generation on text.
Self-attention in decoder: causal masking ✓. Cross-attention (dec→enc): no masking ✓. Self-attention in encoder: no masking ✓.