ML//Transformer//attention//multi-head attention

Run self-attention multiple times in parallel with different learned projections — each head operates on a reduced subspace.


Run self-attention multiple times in parallel with different learned projections — each head operates on a reduced subspace.

Why reduce dimension? Each head can specialize: syntax, coreference, positional patterns, semantic similarity, induction (pattern-copying) — different types of relationships in different subspaces.

No explicit mechanism forces this — specialization emerges from training. Two heads learning the same thing don't reduce loss, so backprop pushes them to diversify.

GPT-3: 96 heads per layer, each 128-dim. Concatenate all outputs, project back to model dimension via W_O.

GQA shares K/V across groups of heads to reduce KV cache — the modern optimization.

Heads can form cross-layer circuits: induction heads are two heads in consecutive layers that coordinate — one finds matching prefixes, the other copies what followed.