ML//Transformer//attention//causal masking

Set the upper triangle of the attention matrix to -∞ (→ 0 after softmax) so tokens can't attend to future positions.


Set the upper triangle of the attention matrix to -∞ (→ 0 after softmax) so tokens can't attend to future positions.

Effect: attention never carries information from later tokens back to earlier ones — preserves autoregressive property.

The new token CAN attend to all previous tokens. Previous tokens CANNOT attend to the new one.

Enables KV cache: since previous tokens' representations are "closed" (won't change), we can cache their K and V.

Without causal masking (encoder-style): all tokens see all others → better enrichment of early positions, but can't do generation.

Another reason not to save enriched embeddings: without masking you'd have to RECOMPUTE EVERYTHING each time a new token arrives.

BERT has no masking → bidirectional understanding. GPT has masking → autoregressive generation. Encoder-decoder cross-attention has no masking (decoder sees all encoder tokens freely)