ML//Transformer//attention//cross-attention

2020-03-15

Q from one sequence, K and V from another: how the decoder "looks at" the encoder output.

Q from one sequence, K and V from another: how the decoder "looks at" the encoder output.

No causal masking: the decoder can see ALL encoder tokens freely. Makes sense: when translating, you should see the whole source sentence.

The Q starts from [BOS], the cold start. "How do I begin translating when I haven't generated anything yet?" The [BOS] embedding, processed by decoder layers, IS the initial query.

K and V from the encoder are computed once and cached: they don't change as the decoder generates token by token.

Also how DALL-E and Stable Diffusion condition image generation on text.

Self-attention in decoder: causal masking ✓. Cross-attention (dec→enc): no masking ✓. Self-attention in encoder: no masking ✓.