ML//Transformer//residual stream

2026-03-05

The "highway" that carries information through the entire transformer: the vector that gets iteratively modified by attention and MLP blocks via residual connections

The "highway" that carries information through the entire transformer: the vector that gets iteratively modified by attention and MLP blocks via residual connections

Each layer reads from the stream, computes a delta (attention or MLP), and writes back: stream = stream + attention(stream) then stream = stream + MLP(stream)

Dimension: typically 768 (GPT-2) to 12288 (GPT-3). Fixed width from embedding matrix input to LM head output.

The final hidden state (aka last hidden state) = the residual stream vector after all N layers, before the LM head. This is the vector that gets multiplied by the embedding matrix transposed to produce logits

Path dependency: once the model starts generating in a direction, each token becomes context that shifts the stream further along that trajectory. Wrong tokens push into unfamiliar territory, an attractor that's hard to escape. This is how exposure bias manifests mechanistically.

The residual stream IS the latent space trajectory. Each layer moves the vector to a new point in this high-dimensional space.

Why it matters for mechanistic interpretability: every layer's contribution is an additive delta: you can read off what each attention head or MLP layer added. Induction heads are the proof: their pattern-copying deltas are identifiable in the stream.