ML//Training//dataset//data augmentation

2026-02-25

Creating new training examples by transforming existing ones: flip an image, add noise, paraphrase text, mask tokens. The goal is to expand the effective dataset size without collecting new data.

Creating new training examples by transforming existing ones: flip an image, add noise, paraphrase text, mask tokens. The goal is to expand the effective dataset size without collecting new data.

The primary defense against overfitting when you can't get more data. By showing the model different views of the same example, you force it to learn invariances (a cat is still a cat when flipped) instead of memorizing specific pixel patterns.

In vision: random cropping, rotation, color jitter, cutout (mask random patches). These are standard for CNN training. Convolution's translation equivariance handles some invariances architecturally, but augmentation covers the rest.

In NLP: back-translation (translate to another language and back), synonym replacement, random insertion/deletion. Less natural than vision augmentation because small text changes can alter meaning.

Masked language modeling (BERT) is arguably a form of augmentation: by randomly masking tokens, each training pass sees a different "view" of the same sentence. The model must reconstruct the original from partial information.

Distinct from synthetic data: augmentation transforms real examples (preserving ground truth), synthetic data generates entirely new examples (from a model or procedural rules). Augmentation is safer because the transformations are meaning-preserving by design, while synthetic data can introduce model collapse if the generation model has systematic biases.