ML//Transformer//encoder-decoder
The original Transformer architecture: encoder processes input bidirectionally, decoder generates output autoregressively.
The original Transformer architecture: encoder processes input bidirectionally, decoder generates output autoregressively.
Encoder: self-attention without masking (all tokens see all). Decoder: self-attention with causal masking + cross-attention to encoder.
T5 kept this structure. GPT dropped the encoder (decoder-only). BERT dropped the decoder (encoder-only)
The architectural split that defined the field: generation vs understanding.
Translation example: encoder processes "El gato duerme" (full bidirectional understanding), decoder generates "The cat sleeps" autoregressively, attending to the encoder via cross-attention at each step.