ML//model//BERT

2019-01-20

Bidirectional Encoder Representations from Transformers (Google, 2018)

Bidirectional Encoder Representations from Transformers (Google, 2018)

Encoder-only: sees the full input in both directions (no causal masking), unlike GPT's left-to-right.

Trained with masked language modeling (predict hidden tokens) + NSP (predict if sentences are consecutive, later found unhelpful)

No masking = better enrichment of all positions: early words get fully disambiguated by later context. But can't generate text.

Output: one vector per token. [CLS] token for sentence-level tasks. Downstream layers convert vectors to task outputs.

Dominated NLP benchmarks for 2 years. Fine-tuned with contrastive learning → sentence transformers for RAG