ML//model//GPT//o1
OpenAI's first frontier reasoning model (September 2024) — the model that proved extended thinking scales.
OpenAI's first frontier reasoning model (September 2024) — the model that proved extended thinking scales.
Uses RL on chain of thought: trained to generate high-quality reasoning chains. Reward signal = is the final answer correct? Likely uses PRMs (score each reasoning step, not just the outcome)
Core insight: more test-time compute = better answers. Instead of making the model bigger, let it think longer.
The thinking is hidden from the user — the model generates internal reasoning tokens that are discarded before showing the response.
Dominated math, coding, and science benchmarks — outperformed GPT-4 on GPQA, MATH, and competition-level problems.
Overthinking weakness: empirically worse than GPT-4 on simple common-sense questions — extended thinking on trivial problems triggers a distributional shift to the wrong basin
OpenAI didn't publish full technical details. But DeepSeek R1 published its pipeline (SFT + GRPO) and matched o1 — suggesting the approach isn't magic.