ML//Transformer//attention//multi-head attention

2018-08-20

Run self-attention multiple times in parallel with different learned projections: each head operates on a reduced subspace.

Run self-attention multiple times in parallel with different learned projections: each head operates on a reduced subspace.

Why reduce dimension? Each head can specialize: syntax, coreference, positional patterns, semantic similarity, induction (pattern-copying). Different types of relationships in different subspaces.

No explicit mechanism forces this: specialization emerges from training. Two heads learning the same thing don't reduce loss, so backprop pushes them to diversify.

GPT-3: 96 heads per layer, each 128-dim. Concatenate all outputs, project back to model dimension via W_O.

GQA shares K/V across groups of heads to reduce KV cache, the modern optimization.

Heads can form cross-layer circuits: induction heads are two heads in consecutive layers that coordinate. One finds matching prefixes, the other copies what followed.