ML//Transformer//decoder
GPT-style architecture: self-attention with causal masking — each token only sees tokens before it.
GPT-style architecture: self-attention with causal masking — each token only sees tokens before it.
GPT is decoder-only: no encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.
In encoder-decoder models (T5, translation): decoder has both causal self-attention AND cross-attention to the encoder output.
Cross-attention in decoder: Q comes from the target language (what we're generating), K and V from the source (encoder output). Starts with [BOS] as the cold start.
Masking enables KV cache: previous tokens are "closed" — their representations don't change when a new token arrives.