ML//Training//SFT

2026-02-15

You give the AI a script: input X -> output A. We call that an instruction tuning dataset.

You give the AI a script: input X -> output A. We call that an instruction tuning dataset.

Creates a rigid ceiling: A acts as a massive gravity attractor (singular attractor learning via cross-entropy loss + SGD), so the AI stops exploring.

Valid, beautiful synonyms (Bx, By, Bz) are sacrificed at the altar of literal imitation.

It becomes a "helpful companion", obedient but uncreative.

Exposure bias: during SFT the model always sees perfect human-written context, but at inference it sees its own outputs. Errors compound because the model never trained on its own mistakes as context.

Iterative SFT

Iterative SFT (rejection sampling / RFT): generate many answers, pick the best, pretend it was ground truth, train on it, repeat.

Downside of iterative: echo chamber amplifies biases and slop until the AI loses edge and variety. Also "average in, average out" (if best of 100 is a 7/10, you're training mediocrity)

Synthetic SFT umbrella: RFT, Constitutional AI outputs for SFT, distillation (big model writes, small model learns), self-instruct (AI generates the questions too)