Math//compositionality
The principle that complex functions can be built by composing simpler ones: \(f = f_n \circ f_{n-1} \circ \cdots \circ f_1\)
The principle that complex functions can be built by composing simpler ones: f=fn∘fn−1∘⋯∘f1f = f_n \circ f_{n-1} \circ \cdots \circ f_1f=fn∘fn−1∘⋯∘f1
In transformers: each layer is a function, and the full model is their composition. The residual stream carries the intermediate result between compositions.
In extended thinking: each decoding step applies the full transformer stack to a new input (the context including all previous thinking tokens). The chain of thought is a composition of compositions — nnn steps, each applying LLL layers, for n×Ln \times Ln×L total transformations.
Analogy to Gaussian elimination: solving nnn equations requires nnn elimination steps. Each step simplifies the system. No shortcut exists — the intermediate steps are necessary because each one depends on the previous. Similarly, complex reasoning requires intermediate thinking tokens because each one changes the context for the next.
PSPACE connection: the class of problems solvable with polynomial memory but potentially exponential time. Chain-of-thought reasoning gives the model polynomial "scratch space" (the context window) to solve problems that would require exponential search without it. The thinking tokens are the polynomial workspace.
Why compositionality matters for training: SFT can teach the model what compositions look like (imitating reasoning chains), but only RL can teach it which compositions are worth computing — which intermediate steps actually help.