ML//model//BERT//masked language modeling
Mask 15% of input tokens randomly with [MASK], train the model to predict them from context of both sides.
Mask 15% of input tokens randomly with [MASK], train the model to predict them from context of both sides.
Deceptively simple — forces genuine bidirectional understanding.
NOT like causal language modeling (GPT): GPT predicts the next token left-to-right, MLM predicts hidden tokens in any position using full bidirectional context.
The model can't just copy from left-to-right; it has to reason about the masked position from surrounding words.
Output: one prediction per masked position. The model uses the full context (past AND future tokens) to disambiguate.