ML//model//GPT//causal language modeling
GPT's training objective: predict the next token using only tokens to the left. No [MASK], no bidirectional context.
GPT's training objective: predict the next token using only tokens to the left. No [MASK], no bidirectional context.
Opposite of MLM (BERT): MLM predicts hidden tokens from both sides, CLM predicts the future from the past.
The causal mask enforces this during training — each position can only attend to previous positions.
Training produces T-1 examples from a sequence of T tokens — every position is a prediction target.
The same objective from GPT-1 through GPT-4: scale the data, scale the model, same basic task.