ML//Inference//decoding step
One forward pass = one new token generated. The full inference = a sequence of decoding steps.
One forward pass = one new token generated. The full inference = a sequence of decoding steps.
At each step: the new token attends to all previous tokens, but previous tokens do NOT attend to the new one (causal masking)
KV cache makes this efficient: only compute K, V for the new token, reuse cached K, V for all previous tokens.
Without KV cache: for sequence length N, you'd recompute attention over all N tokens at every step. With cache: compute only for the new token. For 4K+ sequences, 10-50x speedup.
The cache grows linearly with context length — part of why long context is expensive (memory, not compute)
Forward pass = 1 decoding step. Entire generation = inference.