ML//Transformer//LM head

2026-02-15

The final layer that converts the last vector of the residual stream into logits over the vocabulary

The final layer that converts the last vector of the residual stream into logits over the vocabulary

Pipeline: residual stream [768 dims] → LM head (W_E transposed) → logits [50K vocab] → softmax → token.

The input is called the final hidden state (or last hidden state): the residual stream vector after all N layers, before this projection.

Wu IS the embedding matrix transposed (weight tying): the same parameters that encode tokens also decode the output.

Each row represents a token in the vocabulary: the dot product between the final hidden state and each row = "how aligned is this representation with each possible next token".

Wu is as wide as the last embedding dimension: the tokenizer/embedding size determines the bottleneck.