ML//Transformer//encoder

BERT-style architecture: self-attention is fully bidirectional (no masking)


BERT-style architecture: self-attention is fully bidirectional (no masking)

Every token attends to every other token freely — early words get fully disambiguated by later context.

Output: one vector per token. The [CLS] token aggregates global sentence meaning for classification.

Without masking = better enrichment of all positions, but can't generate text (seeing the future defeats autoregressive generation)

Used for understanding tasks: classification, NER, similarity. Downstream layers sit on top to convert vectors into task outputs.