ML//Inference
At inference time, only the last token's hidden state matters — attention has encoded all previous context into a single 1×D vector.
At inference time, only the last token's hidden state matters — attention has encoded all previous context into a single 1×D vector.
That vector hits the LM Head, producing logits over the vocabulary.
sampling picks the actual next token from those logits.
Auto-regressive: each generated token becomes input for the next — the model eats its own output.