Information Theory
Founded by Claude Shannon in "A Mathematical Theory of Communication" (1948) — the paper that launched the digital age.
Founded by Claude Shannon in "A Mathematical Theory of Communication" (1948) — the paper that launched the digital age.
Core question: how much information can be transmitted reliably through a noisy channel?
Key quantities: entropy (uncertainty of a source), mutual information (shared information between variables), channel capacity (maximum reliable transmission rate)
Shannon showed that information is quantifiable, independent of meaning — a bit is a bit whether it encodes poetry or noise.
Deep connection to thermodynamics: Boltzmann entropy measures disorder in physical systems, Shannon entropy measures uncertainty in messages. The formulas are structurally identical.
Connection to ML
In ML: the cross-entropy loss function is a direct application — it measures how surprised the model is by the data, in bits. Pretraining is, at its core, entropy minimization.
Shannon's noisy-channel coding theorem: if you transmit below channel capacity, error-free communication is possible. Above it, errors are inevitable. This has a deep analogy to model capacity — a model too small for the task will always make errors, no matter how long you train it.