ML//reasoning model
Models trained to "think" before answering via extended thinking — generate intermediate reasoning tokens, then produce the final answer from a richer context
Models trained to "think" before answering via extended thinking — generate intermediate reasoning tokens, then produce the final answer from a richer context
o1 (OpenAI, Sep 2024): first frontier reasoning model. Uses RL on CoT, likely with process reward models.
o3 (OpenAI, Dec 2024): scaled further, configurable compute budgets, unprecedented ARC-AGI scores.
R1 (DeepSeek, Jan 2025): open-weight, matched o1 via GRPO — no SFT needed, proved the approach isn't proprietary.
The paradigm dominates 2025. Core insight: more test-time compute = better answers. Trade model size for thinking time.
Mechanistically: distributional shift — reasoning tokens position the model in the latent space region where pretraining associated explicit thought with correct conclusions.