Information Theory//entropy

A measure of average uncertainty in a probability distribution — how many bits you need, on average, to identify an outcome.


A measure of average uncertainty in a probability distribution — how many bits you need, on average, to identify an outcome.

Formula: H(X)=−∑ip(xi)log⁡2p(xi)H(X) = -\sum_i p(x_i) \log_2 p(x_i)H(X)=−∑i​p(xi​)log2​p(xi​). Maximum when all outcomes are equally likely (uniform distribution), zero when the outcome is certain.

In ML: the output distribution over tokens has an entropy. High entropy = the model is uncertain (many plausible next tokens). Low entropy = the model is confident (one token dominates)

Extended thinking works by reducing entropy: each reasoning token narrows the distribution over next tokens, making the correct continuation more probable. This is a distributional shift toward a basin of attraction where correct answers cluster.

The connection to cross-entropy loss: cross-entropy H(p,q)=H(p)+DKL(p∥q)H(p, q) = H(p) + D_{KL}(p | q)H(p,q)=H(p)+DKL​(p∥q). Minimizing cross-entropy means both reducing the model's surprise AND aligning its distribution with the true one.

In physics: Boltzmann entropy S=kBln⁡WS = k_B \ln WS=kB​lnW counts microstates. Shannon entropy counts distinguishable messages. Same math, different domains.

Temperature in sampling: high temperature raises entropy (flattens distribution, more randomness), low temperature lowers it (sharpens, more deterministic). Temperature literally controls the entropy of the output distribution.