ML//Transformer//encoder
BERT-style architecture: self-attention is fully bidirectional (no masking)
BERT-style architecture: self-attention is fully bidirectional (no masking)
Every token attends to every other token freely — early words get fully disambiguated by later context.
Output: one vector per token. The [CLS] token aggregates global sentence meaning for classification.
Without masking = better enrichment of all positions, but can't generate text (seeing the future defeats autoregressive generation)
Used for understanding tasks: classification, NER, similarity. Downstream layers sit on top to convert vectors into task outputs.