ML//Transformer//feed-forward network

2018-09-10

The "other half" of each transformer layer: linear → activation → linear. W₂ · ReLU(W₁ · x + b₁) + b₂ (or GELU in modern models)

The "other half" of each transformer layer: linear → activation → linear. W₂ · ReLU(W₁ · x + b₁) + b₂ (or GELU in modern models)

It's basically a perceptron that chunks features: if the current embedding aligns with a feature direction above threshold (ReLU), that feature gets added.

Example: if a vector already points toward "Michael" and "Jordan", the MLP detects compatibility and adds the "basketball" direction. Linears have bias.

Most of the model's parameters live here (~2/3), not in attention (~1/3). MLP stores facts, attention routes context.

Recent theory: the FFN acts as a key-value memory (keys are the first layer, values the second).

No matter how big the model, the tokenizer (embedding dimension) bottlenecks comprehension. MLP width is limited by it.

Multilayer perceptrons live in MLP layers, not attention layers.