ML//Training//Pre-training

2026-02-15

Imagine a baby alien staring at the static of the universe. It stares at billions of sentences until it realizes "the pilot has turned on the..." is followed by "seatbelt sign" and not "karaoke machine".

Pure next-token prediction at massive scale: no human feedback, no instructions.

Pretraining is the ceiling: SFT and DPO/RL don't create new capabilities: they navigate toward useful regions of the distribution already learned here. If the pretraining data lacks high-quality reasoning, no amount of RL recovers it. Models small + much RL < models large + little RL.

This makes model collapse especially dangerous: the sophisticated reasoning that lives in the tail distribution is the first thing erased by recursive synthetic training, and once gone from pretraining, extended thinking can't reach those regions.

Continued pre-training: like sending a toddler who just learned to speak to medical school, then dumping a mountain of niche books on its head.

GPT-2 was pure pre-training (no fine-tuning). OpenAI wanted to see if volume alone was enough.

GPT-3 was the same but massive. It predicted text beautifully but didn't follow instructions.