ML//neural network//softmax
Converts a vector of raw scores (logits) into probabilities that sum to 1: exp(xᵢ)/Σexp(xⱼ)
Converts a vector of raw scores (logits) into probabilities that sum to 1: exp(xᵢ)/Σexp(xⱼ)
Not an activation function — it's a normalization that creates a probability distribution.
Two key uses in transformers:
In attention: applied per column of QKᵀ scores → how much each token attends to every other.
In output: applied to logits over vocabulary → probability of each possible next token.
Temperature controls sharpness: T<1 makes the distribution peakier (more confident), T>1 flattens it (more random)
The inputs to softmax at the output layer are called logits — raw pre-normalization scores.