ML//Transformer//decoder

GPT-style architecture: self-attention with causal masking — each token only sees tokens before it.


GPT-style architecture: self-attention with causal masking — each token only sees tokens before it.

GPT is decoder-only: no encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.

In encoder-decoder models (T5, translation): decoder has both causal self-attention AND cross-attention to the encoder output.

Cross-attention in decoder: Q comes from the target language (what we're generating), K and V from the source (encoder output). Starts with [BOS] as the cold start.

Masking enables KV cache: previous tokens are "closed" — their representations don't change when a new token arrives.