ML//model//GPT
Decoder-only. No encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.
Decoder-only. No encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.
Trained with causal language modeling: predict next token using only previous tokens.
Evolution
GPT-1 (2018): pre-training + SFT — but SFT for specific tasks like classification, not for chatting.
GPT-2: pure pre-training — OpenAI wanted to see if it learned just from volume, no public fine-tuning.
GPT-3: pure pre-training, massive — predicted text beautifully but didn't follow instructions.
InstructGPT / GPT-3.5: the inflection point — took GPT-3 and added SFT + RLHF, birth of the "assistant".
GPT-4 / GPT-4o: pre-training + SFT + RLHF + RLAIF — same pipeline but scaled, using AIs to help score and correct.
Doesn't use [CLS] — simply takes the last token (already optimized to carry full sequence meaning via causal masking)