ML//Transformer//residual connection

2018-09-10

x + f(x): add the input directly to the output of each sublayer.

x + f(x): add the input directly to the output of each sublayer.

Creates the residual stream: a persistent vector that flows through all layers, accumulating additive deltas from attention and MLP blocks.

Keeps gradients flowing through 100+ layers without vanishing

Borrowed from ResNet. Every transformer block has two: one after attention, one after FFN.

Together with layer norm, they form the structural backbone of each [Attention → MLP] block.