ML//autoregressive

2026-02-25

The generation paradigm behind GPT and all decoder-only models: predict the next token given all previous tokens, one at a time, left to right.

The generation paradigm behind GPT and all decoder-only models: predict the next token given all previous tokens, one at a time, left to right.

This is why causal masking exists. If the model could see future tokens during training, it would cheat. Masking forces it to learn genuine left-to-right prediction, matching how it must generate at inference time.

The KV cache optimization is a direct consequence: since each token's representation depends only on tokens before it, previous computations are "closed" and can be cached. Non-autoregressive models can't do this.

Generation loop: run forward pass to get logits from LM head, sample next token, append to sequence, repeat. Each iteration is one decoding step

The fundamental bottleneck: generation is sequential. You can't parallelize across tokens because each depends on the previous. Speculative decoding tries to work around this with a draft-then-verify approach.

Training is parallel (all positions computed at once via masking), but inference is serial. This asymmetry is why training a model is fast per token but generation is slow.

BERT is the counterexample: it's NOT autoregressive. It sees all tokens simultaneously (bidirectional), which is why it's better at understanding but can't generate text naturally.