ML//model//GPT

Decoder-only. No encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.


Decoder-only. No encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.

Trained with causal language modeling: predict next token using only previous tokens.

Evolution

GPT-1 (2018): pre-training + SFT — but SFT for specific tasks like classification, not for chatting.

GPT-2: pure pre-training — OpenAI wanted to see if it learned just from volume, no public fine-tuning.

GPT-3: pure pre-training, massive — predicted text beautifully but didn't follow instructions.

InstructGPT / GPT-3.5: the inflection point — took GPT-3 and added SFT + RLHF, birth of the "assistant".

GPT-4 / GPT-4o: pre-training + SFT + RLHF + RLAIF — same pipeline but scaled, using AIs to help score and correct.

Doesn't use [CLS] — simply takes the last token (already optimized to carry full sequence meaning via causal masking)