ML//Transformer//attention//causal masking
Set the upper triangle of the attention matrix to -∞ (→ 0 after softmax) so tokens can't attend to future positions.
Set the upper triangle of the attention matrix to -∞ (→ 0 after softmax) so tokens can't attend to future positions.
Effect: attention never carries information from later tokens back to earlier ones — preserves autoregressive property.
The new token CAN attend to all previous tokens. Previous tokens CANNOT attend to the new one.
Enables KV cache: since previous tokens' representations are "closed" (won't change), we can cache their K and V.
Without causal masking (encoder-style): all tokens see all others → better enrichment of early positions, but can't do generation.
Another reason not to save enriched embeddings: without masking you'd have to RECOMPUTE EVERYTHING each time a new token arrives.
BERT has no masking → bidirectional understanding. GPT has masking → autoregressive generation. Encoder-decoder cross-attention has no masking (decoder sees all encoder tokens freely)