ML//Transformer//attention//head

One unit of multi-head attention: its own W_Q, W_K, W_V projections operating on a reduced dimension (e.g. 128-dim instead of 12288)


One unit of multi-head attention: its own W_Q, W_K, W_V projections operating on a reduced dimension (e.g. 128-dim instead of 12288)

Why reduce? Each head can specialize in a different type of relationship (syntax, coreference, semantic, positional) by operating in its own subspace.

No explicit mechanism forces specialization — it emerges from training. If two heads learn the same thing, they don't reduce loss, so backprop pushes them to diversify.

Emergent specialization

Empirically observed specializations: syntax tracking, coreference resolution, positional distance, and induction heads (pattern-copying circuits for in-context learning)

GPT-3: 96 heads per layer × 96 layers. Outputs are concatenated and projected back to model dimension via W_O.

The full process (Q·K → softmax → ×V → sum columns → suggested vector change) IS one head of attention.

The question mechanistic interpretability asks: if you can identify what each head does across 96 heads × 96 layers, you have a complete map of how the model thinks.