ML//Training//Pre-training
Imagine a baby alien staring at the static of the universe — it stares at billions of sentences until it realizes "the pilot has turned on the..." is followed by "seatbelt sign" and not "karaoke machine".
Imagine a baby alien staring at the static of the universe — it stares at billions of sentences until it realizes "the pilot has turned on the..." is followed by "seatbelt sign" and not "karaoke machine".
Pure next-token prediction at massive scale — no human feedback, no instructions.
Pretraining is the ceiling: SFT and DPO/RL don't create new capabilities — they navigate toward useful regions of the distribution already learned here. If the pretraining data lacks high-quality reasoning, no amount of RL recovers it. Models small + much RL < models large + little RL.
This makes model collapse especially dangerous: the sophisticated reasoning that lives in the tail distribution is the first thing erased by recursive synthetic training — and once gone from pretraining, extended thinking can't reach those regions.
Continued pre-training: like sending a toddler who just learned to speak to medical school — dump a mountain of niche books on its head.
GPT-2 was pure pre-training (no fine-tuning) — OpenAI wanted to see if volume alone was enough.
GPT-3 was the same but massive — predicted text beautifully but didn't follow instructions.