ML//Transformer//forward pass

One complete pass through the network: embeddings + positional encoding → { Attention → MLP } × N layers → layer norm → LM head → logits → softmax → token.


One complete pass through the network: embeddings + positional encoding → { Attention → MLP } × N layers → layer norm → LM head → logits → softmax → token.

The residual stream carries information through all layers — each block adds a delta via residual connections. The final hidden state is the stream's last value.

In inference: 1 forward pass = 1 token predicted (one decoding step)

In training: processes the full sequence in parallel — the causal mask ensures each position only sees prior tokens, producing T-1 training examples in one pass.

Starts from scratch every time. Embeddings are NOT carried over from the previous pass — that would be reinventing RNNs

Standard pattern: [Attention → MLP] × N layers. GPT-3: N=96. GPT-2 small: N=12. Each block has residual connections and layer norm

Path through latent space: alternating context (attention) and fact (MLP) — 1 context, 1 fact, 1 context, 1 fact.