ML//Transformer//attention//self-attention
Q, K, V all come from the same sequence — each token attends to all others including itself.
Q, K, V all come from the same sequence — each token attends to all others including itself.
The novel contribution of the 2017 paper — attention existed before, but self-attention for sequence processing was new.
The n² cost that defines transformer scaling: every pair of tokens interacts.
softmax(QKᵀ/√d) V — the dot product measures relevance, softmax normalizes, values get weighted.
In encoder (BERT): bidirectional, no masking — every token sees everything.
In decoder (GPT): causal, with masking — each token only sees the past.