ML//Transformer//encoder-decoder

The original Transformer architecture: encoder processes input bidirectionally, decoder generates output autoregressively.


The original Transformer architecture: encoder processes input bidirectionally, decoder generates output autoregressively.

Encoder: self-attention without masking (all tokens see all). Decoder: self-attention with causal masking + cross-attention to encoder.

T5 kept this structure. GPT dropped the encoder (decoder-only). BERT dropped the decoder (encoder-only)

The architectural split that defined the field: generation vs understanding.

Translation example: encoder processes "El gato duerme" (full bidirectional understanding), decoder generates "The cat sleeps" autoregressively, attending to the encoder via cross-attention at each step.