ML//Inference

At inference time, only the last token's hidden state matters — attention has encoded all previous context into a single 1×D vector.


At inference time, only the last token's hidden state matters — attention has encoded all previous context into a single 1×D vector.

That vector hits the LM Head, producing logits over the vocabulary.

sampling picks the actual next token from those logits.

Auto-regressive: each generated token becomes input for the next — the model eats its own output.