ML//Transformer//attention//head
One unit of multi-head attention: its own W_Q, W_K, W_V projections operating on a reduced dimension (e.g. 128-dim instead of 12288)
One unit of multi-head attention: its own W_Q, W_K, W_V projections operating on a reduced dimension (e.g. 128-dim instead of 12288)
Why reduce? Each head can specialize in a different type of relationship (syntax, coreference, semantic, positional) by operating in its own subspace.
No explicit mechanism forces specialization — it emerges from training. If two heads learn the same thing, they don't reduce loss, so backprop pushes them to diversify.
Emergent specialization
Empirically observed specializations: syntax tracking, coreference resolution, positional distance, and induction heads (pattern-copying circuits for in-context learning)
GPT-3: 96 heads per layer × 96 layers. Outputs are concatenated and projected back to model dimension via W_O.
The full process (Q·K → softmax → ×V → sum columns → suggested vector change) IS one head of attention.
The question mechanistic interpretability asks: if you can identify what each head does across 96 heads × 96 layers, you have a complete map of how the model thinks.