ML//reasoning model

2026-02-25

Models trained to "think" before answering via extended thinking: generate intermediate reasoning tokens, then produce the final answer from a richer context

Models trained to "think" before answering via extended thinking: generate intermediate reasoning tokens, then produce the final answer from a richer context

o1 (OpenAI, Sep 2024): first frontier reasoning model. Uses RL on CoT, likely with process reward models.

o3 (OpenAI, Dec 2024): scaled further, configurable compute budgets, unprecedented ARC-AGI scores.

R1 (DeepSeek, Jan 2025): open-weight, matched o1 via GRPO, no SFT needed, proved the approach isn't proprietary.

The paradigm dominates 2025. Core insight: more test-time compute = better answers. Trade model size for thinking time.

Mechanistically: distributional shift. Reasoning tokens position the model in the latent space region where pretraining associated explicit thought with correct conclusions.