ML//Transformer//attention//multi-head attention
Run self-attention multiple times in parallel with different learned projections — each head operates on a reduced subspace.
Run self-attention multiple times in parallel with different learned projections — each head operates on a reduced subspace.
Why reduce dimension? Each head can specialize: syntax, coreference, positional patterns, semantic similarity, induction (pattern-copying) — different types of relationships in different subspaces.
No explicit mechanism forces this — specialization emerges from training. Two heads learning the same thing don't reduce loss, so backprop pushes them to diversify.
GPT-3: 96 heads per layer, each 128-dim. Concatenate all outputs, project back to model dimension via W_O.
GQA shares K/V across groups of heads to reduce KV cache — the modern optimization.
Heads can form cross-layer circuits: induction heads are two heads in consecutive layers that coordinate — one finds matching prefixes, the other copies what followed.