ML//Transformer//tokenizer//embedding matrix
Converts token IDs into dense vectors: token_id 4821 → vector[512 dims]. These are trainable parameters.
Converts token IDs into dense vectors: token_id 4821 → vector[512 dims]. These are trainable parameters.
The same matrix, transposed, converts the last layer's output back into "similarity with each token in the vocabulary" — this is weight tying
Size = vocab_size × embedding_dim. No matter how big the model, the tokenizer (embedding dimension) bottlenecks comprehension.
Not a neural network — it's a lookup table that maps discrete IDs to continuous vectors.
The embeddings it produces are the raw starting point — every forward pass resets from these base vectors.