ML//Transformer//decoder

2026-03-02

GPT-style architecture: self-attention with causal masking: each token only sees tokens before it.

GPT-style architecture: self-attention with causal masking: each token only sees tokens before it.

GPT is decoder-only: no encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.

In encoder-decoder models (T5, translation): decoder has both causal self-attention AND cross-attention to the encoder output.

Cross-attention in decoder: Q comes from the target language (what we're generating), K and V from the source (encoder output). Starts with [BOS] as the cold start.

Masking enables KV cache: previous tokens are "closed": their representations don't change when a new token arrives.