ML//Transformer//residual stream

The "highway" that carries information through the entire transformer: the vector that gets iteratively modified by attention and MLP blocks via residual connections


The "highway" that carries information through the entire transformer: the vector that gets iteratively modified by attention and MLP blocks via residual connections

Each layer reads from the stream, computes a delta (attention or MLP), and writes back: stream = stream + attention(stream) then stream = stream + MLP(stream)

Dimension: typically 768 (GPT-2) to 12288 (GPT-3). Fixed width from embedding matrix input to LM head output.

The final hidden state (aka last hidden state) = the residual stream vector after all N layers, before the LM head. This is the vector that gets multiplied by the embedding matrix transposed to produce logits

Path dependency: once the model starts generating in a direction, each token becomes context that shifts the stream further along that trajectory. Wrong tokens push into unfamiliar territory — an attractor that's hard to escape. This is how exposure bias manifests mechanistically.

The residual stream IS the latent space trajectory — each layer moves the vector to a new point in this high-dimensional space.

Why it matters for mechanistic interpretability: every layer's contribution is an additive delta — you can read off what each attention head or MLP layer added. Induction heads are the proof: their pattern-copying deltas are identifiable in the stream.