ML//Inference//extended thinking

Generate intermediate reasoning tokens before the final answer — the core technique behind reasoning models (o1, o3, R1, QwQ)


Generate intermediate reasoning tokens before the final answer — the core technique behind reasoning models (o1, o3, R1, QwQ)

Why it works — the mechanical explanation: the transformer has no memory between tokens, only context. Thinking tokens literally change the context before the answer is generated:

Without thinking: [question] -> answer (context = just the question)

With thinking: [question][reasoning...] -> answer (context = question + reasoning). The answer emerges from a richer, different context.

This triggers a distributional shift: reasoning-then-conclusion is a distinct basin of attraction in latent space. During pretraining, after explicit reasoning came more coherent conclusions. The thinking repositions the model into that basin.

Not magic — it's geometry of the embedding space plus accumulated context. The thinking doesn't teach the model anything new; it positions it where it already knows how to produce quality output.

Pretraining is the ceiling: SFT and RL post-training don't create new capabilities — they navigate toward useful regions of the distribution learned during pretraining. If the pretraining data didn't contain high-quality reasoning, no amount of RL recovers it. Models small + much RL < models large + little RL.

Scales with compute: more thinking tokens = richer context = more precise positioning. o1 does linear thinking, o3 likely adds tree search (explore N branches, pick the best)

Chain of thought was the prompting version. Extended thinking is the trained, RL-optimized version with dedicated compute budgets.

Failure modes

Overthinking on simple problems (distributional shift to the wrong basin), exposure bias compounding (each generated token is context the model may never have seen in training), path dependency in the residual stream

In RLHF/RLAIF: humans evaluate reasoned responses higher -> the model learned that showing reasoning produces structure leading to better-scored outputs.

Each thinking token is a decoding step that moves the residual stream to a new point — by the time the answer starts, the model is in a completely different neighborhood.