Information Theory//entropy
A measure of average uncertainty in a probability distribution — how many bits you need, on average, to identify an outcome.
A measure of average uncertainty in a probability distribution — how many bits you need, on average, to identify an outcome.
Formula: H(X)=−∑ip(xi)log2p(xi)H(X) = -\sum_i p(x_i) \log_2 p(x_i)H(X)=−∑ip(xi)log2p(xi). Maximum when all outcomes are equally likely (uniform distribution), zero when the outcome is certain.
In ML: the output distribution over tokens has an entropy. High entropy = the model is uncertain (many plausible next tokens). Low entropy = the model is confident (one token dominates)
Extended thinking works by reducing entropy: each reasoning token narrows the distribution over next tokens, making the correct continuation more probable. This is a distributional shift toward a basin of attraction where correct answers cluster.
The connection to cross-entropy loss: cross-entropy H(p,q)=H(p)+DKL(p∥q)H(p, q) = H(p) + D_{KL}(p | q)H(p,q)=H(p)+DKL(p∥q). Minimizing cross-entropy means both reducing the model's surprise AND aligning its distribution with the true one.
In physics: Boltzmann entropy S=kBlnWS = k_B \ln WS=kBlnW counts microstates. Shannon entropy counts distinguishable messages. Same math, different domains.
Temperature in sampling: high temperature raises entropy (flattens distribution, more randomness), low temperature lowers it (sharpens, more deterministic). Temperature literally controls the entropy of the output distribution.