ML//Transformer//attention
Attention existed before the 2017 paper. The key contribution of "Attention Is All You Need" was that attention was... all you needed — no recurrence. Self-attention specifically was novel.
Attention existed before the 2017 paper. The key contribution of "Attention Is All You Need" was that attention was... all you needed — no recurrence. Self-attention specifically was novel.
Attention does context routing: allows encoded information from one token's vector to flow into another's. Every token can look at every other token in parallel.
Q, K, V matrices: query asks a question, keys answer relevance, values carry the content.
O(n²) cost in sequence length — the fundamental scaling constraint. Doblar el contexto = 4x memoria y cómputo.
~1/3 of all you need — attention handles context, the other 2/3 is MLP storing facts.
Replaced the sequential bottleneck of RNNs with parallel computation.