ML//Inference//beam search

2026-03-02

A decoding strategy that keeps the top-K most probable partial sequences at each step, instead of committing to a single token. At each decoding step, expand all K candidates by one token, score them, keep the best K.

The middle ground between greedy decoding (K=1, always pick the highest-probability token) and exhaustive search (explore all possibilities, computationally impossible). Beam width K controls the trade-off.

Why greedy fails: the locally best token isn't always globally best. "The cat sat on the..." could greedily pick "mat" but the full best sequence might start differently. Beam search explores multiple paths simultaneously.

Dominated machine translation and seq2seq tasks for years. But for open-ended text generation, beam search produces repetitive, generic text. It optimizes for probability, which favors "safe" continuations.

Largely replaced by sampling-based methods (temperature, top-k, top-p) for LLM generation. Sampling introduces randomness, which produces more creative and diverse text. Beam search is still used for tasks with a "correct" answer (translation, ASR)

Conceptually related to tree search in reasoning models: both explore multiple paths. But beam search scores by token probability, while tree search scores by reasoning validity (using PRMs). Same structure, different evaluation.