ML//Training//dataset//synthetic data

2024-01-20

Teaching AIs with other AIs' homework: it works surprisingly well until the errors start compounding and nobody remembers what the original looked like.

Teaching AIs with other AIs' homework: it works surprisingly well until the errors start compounding and nobody remembers what the original looked like.

Risk of model collapse: errors compound across generations, diversity shrinks. Tail distribution erodes first.

But carefully curated synthetic data (math proofs, code with verification) works well: the key is external verification, not blind self-training.

The uncomfortable question: what happens when most training data is AI-generated?

Constitutional AI uses synthetic data wisely: AI generates reflection-improved responses, but the constitution provides an external anchor.