ML//Transformer//forward pass

2026-03-03

One complete pass through the network: embeddings + positional encoding → { Attention → MLP } × N layers → layer norm → LM head → logits → softmax → token.

One complete pass through the network: embeddings + positional encoding → { Attention → MLP } × N layers → layer norm → LM head → logits → softmax → token.

The residual stream carries information through all layers. Each block adds a delta via residual connections. The final hidden state is the stream's last value.

In inference: 1 forward pass = 1 token predicted (one decoding step)

In training: processes the full sequence in parallel. The causal mask ensures each position only sees prior tokens, producing T-1 training examples in one pass.

Starts from scratch every time. Embeddings are NOT carried over from the previous pass. That would be reinventing RNNs

Standard pattern: [Attention → MLP] × N layers. GPT-3: N=96. GPT-2 small: N=12. Each block has residual connections and layer norm

Path through latent space: alternating context (attention) and fact (MLP): 1 context, 1 fact, 1 context, 1 fact.