ML//Transformer//latent space

High-dimensional activation paths where matrix multiplications route inputs into specific semantic clusters.


High-dimensional activation paths where matrix multiplications route inputs into specific semantic clusters.

The residual stream IS the trajectory through this space — each layer moves the vector to a new point.

A given context vector is a point — new token, new step along the manifold.

Path alternates: context (attention) → fact (MLP) → context → fact. Each layer moves the vector. The last layer contains organized, routable information.

Different contexts land in different regions — this is distributional shift. Reasoning tokens before an answer position the model in the "coherent conclusion" region. Direct answers land in the "cold response" region. Same model, different neighborhoods.

The last token's final hidden state has been refined through so many layers that it "knows" exactly what it is relative to everything else — sufficient to predict next.

Because logits derive from dot product similarity and generation is auto-regressive, LLM output hardly ever diverges the path — even with low temperature

Mixture of Experts: dynamic activation paths that route inputs into specialized clusters through gating mechanisms — same space, selective computation.