ML//model//GPT

2026-02-15

Decoder-only. No encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.

Decoder-only. No encoder, no cross-attention. If it translates, it learned translation as a pattern during pre-training, not from a dedicated encoder.

Trained with causal language modeling: predict next token using only previous tokens.

Evolution

GPT-1 (2018): pre-training + SFT, but SFT for specific tasks like classification, not for chatting.

GPT-2: pure pre-training. OpenAI wanted to see if it learned just from volume, no public fine-tuning.

GPT-3: pure pre-training, massive. Predicted text beautifully but didn't follow instructions.

InstructGPT / GPT-3.5: the inflection point. Took GPT-3 and added SFT + RLHF, birth of the "assistant".

GPT-4 / GPT-4o: pre-training + SFT + RLHF + RLAIF, same pipeline but scaled, using AIs to help score and correct.

Doesn't use [CLS], simply takes the last token (already optimized to carry full sequence meaning via causal masking)