ML//Transformer//attention//dot product

The mechanism that measures relevance: Q · Kᵀ produces an N×N matrix of alignment scores between every pair of tokens.


The mechanism that measures relevance: Q · Kᵀ produces an N×N matrix of alignment scores between every pair of tokens.

Raw scores look terrible — arbitrarily large, no normalization. Scaled by √d_k to prevent softmax saturation.

Softmax applied per column (each query's distribution over all keys) → probabilities that sum to 1.

Multiply the softmax weights by V → weighted sum of values = the "suggested change" to each token's representation.

Full formula: softmax(QKᵀ/√d_k) · V — this whole process is one attention head

The attention matrix is what makes transformers O(n²) in sequence length — for N=4096, that's 16M values per head.