ML//Training//Pre-training

Imagine a baby alien staring at the static of the universe — it stares at billions of sentences until it realizes "the pilot has turned on the..." is followed by "seatbelt sign" and not "karaoke machine".


Imagine a baby alien staring at the static of the universe — it stares at billions of sentences until it realizes "the pilot has turned on the..." is followed by "seatbelt sign" and not "karaoke machine".

Pure next-token prediction at massive scale — no human feedback, no instructions.

Pretraining is the ceiling: SFT and DPO/RL don't create new capabilities — they navigate toward useful regions of the distribution already learned here. If the pretraining data lacks high-quality reasoning, no amount of RL recovers it. Models small + much RL < models large + little RL.

This makes model collapse especially dangerous: the sophisticated reasoning that lives in the tail distribution is the first thing erased by recursive synthetic training — and once gone from pretraining, extended thinking can't reach those regions.

Continued pre-training: like sending a toddler who just learned to speak to medical school — dump a mountain of niche books on its head.

GPT-2 was pure pre-training (no fine-tuning) — OpenAI wanted to see if volume alone was enough.

GPT-3 was the same but massive — predicted text beautifully but didn't follow instructions.