ML//Inference

2026-02-15

At inference time, only the last token's hidden state matters: attention has encoded all previous context into a single 1×D vector.

At inference time, only the last token's hidden state matters: attention has encoded all previous context into a single 1×D vector.

That vector hits the LM Head, producing logits over the vocabulary.

sampling picks the actual next token from those logits.

Auto-regressive: each generated token becomes input for the next. The model eats its own output.