ML//Transformer
The architecture behind GPT, BERT, and every modern LLM.
The architecture behind GPT, BERT, and every modern LLM.
Standard block: { Attention → MLP } × N layers. GPT-3: N=96. GPT-2 small: N=12. Each block has residual connections + layer norm
~1/3 attention (context routing between tokens), ~2/3 MLP (fact storage and feature detection). Path through latent space: alternating context and fact — 1 context, 1 fact, 1 context, 1 fact.
D latent features per hidden state (D=4096 typical) — most are not human-interpretable abstractions.
Context vector: the last hidden state, right before the LM Head. The last token has been thrown through so many layers it knows exactly what it is relative to everything else — sufficient to predict next.
In inference: a 1×D vector. In training: an n_tokens×D matrix — attention mask computes every token's hidden state, producing T-1 training examples from sequence length T.