ML//Transformer//attention//induction head
One of the most important discoveries of mechanistic interpretability — concrete, verifiable, and functionally understood.
One of the most important discoveries of mechanistic interpretability — concrete, verifiable, and functionally understood.
The pattern they explain: if [A][B] appears in the context and later [A] appears again, the model predicts [B] with high probability. "El gato duerme. El perro ladra. El gato ____" → predicts "duerme".
This IS in-context learning at the lowest level — the model learns patterns within the current sequence, no weight updates.
The circuit
Two attention heads in consecutive layers working together:
Head 1 (prefix matching head): the current token looks back and finds tokens similar to itself in the past.
Head 2 (induction head): copies what came AFTER that previous occurrence.
Not a special architecture — they're regular heads with regular W_Q, W_K, W_V matrices, initialized randomly. The gradient pushed them into this specialization because it reduced loss
In large models there are multiple induction circuits — some more specialized, some for different pattern types. They're a subset of the 96 heads per layer, not all of them.
During training there's a phase transition: when induction heads emerge, the loss drops abruptly — the model suddenly gets much better at following context patterns.
The "hello world" of understanding transformers from the inside — proof that attention heads form interpretable circuits, not just opaque weights.