ML//Transformer//residual connection
x + f(x) — add the input directly to the output of each sublayer.
x + f(x) — add the input directly to the output of each sublayer.
Creates the residual stream: a persistent vector that flows through all layers, accumulating additive deltas from attention and MLP blocks.
Keeps gradients flowing through 100+ layers without vanishing
Borrowed from ResNet. Every transformer block has two: one after attention, one after FFN.
Together with layer norm, they form the structural backbone of each [Attention → MLP] block.