ML//Transformer//encoder-decoder//T5

Text-to-Text Transfer Transformer (Google, 2019): treats every NLP task as a text-to-text problem. Translation? "translate English to French: ..." produces French text. Summarization? "summarize: ..." produces a summary. Classification? "classify: ..." produces a label.


Text-to-Text Transfer Transformer (Google, 2019): treats every NLP task as a text-to-text problem. Translation? "translate English to French: ..." produces French text. Summarization? "summarize: ..." produces a summary. Classification? "classify: ..." produces a label.

The key insight: a unified text-to-text format means one architecture, one loss function, one training procedure for all tasks. No task-specific heads or output layers. The model learns task selection from the prefix.

The canonical encoder-decoder model: the encoder processes the full input with bidirectional attention (like BERT), the decoder generates the output autoregressively (like GPT). This combination is more natural for tasks with distinct input and output.

Trained on C4 (Colossal Clean Crawled Corpus), a cleaned version of Common Crawl. The T5 paper was also a massive empirical study comparing pre-training objectives, architectures, and transfer strategies, establishing many best practices.

Compared to decoder-only (GPT-style): encoder-decoder is better when the input is a fixed, known text and the output is a transformation of it (translation, summarization). Decoder-only is better for open-ended generation where input and output blur together (conversation, creative writing)

Spawned the Flan-T5 family (SFT on 1800+ tasks), which showed that instruction tuning on diverse tasks dramatically improves zero-shot and few-shot performance. This was an early signal that SFT could unlock capabilities that pre-training alone couldn't.