ML//reasoning model

Models trained to "think" before answering via extended thinking — generate intermediate reasoning tokens, then produce the final answer from a richer context


Models trained to "think" before answering via extended thinking — generate intermediate reasoning tokens, then produce the final answer from a richer context

o1 (OpenAI, Sep 2024): first frontier reasoning model. Uses RL on CoT, likely with process reward models.

o3 (OpenAI, Dec 2024): scaled further, configurable compute budgets, unprecedented ARC-AGI scores.

R1 (DeepSeek, Jan 2025): open-weight, matched o1 via GRPO — no SFT needed, proved the approach isn't proprietary.

The paradigm dominates 2025. Core insight: more test-time compute = better answers. Trade model size for thinking time.

Mechanistically: distributional shift — reasoning tokens position the model in the latent space region where pretraining associated explicit thought with correct conclusions.