ML//Training

2026-02-15

The pipeline that takes a raw model from "predicts text" to "follows instructions" to "aligns with human values".

The pipeline that takes a raw model from "predicts text" to "follows instructions" to "aligns with human values".

Rough evolution: pre-training → SFT → RLHF/DPO

Data format is algorithm-agnostic: same (prompt, chosen, rejected) triplets feed DPO, PPO, GRPO. The difference is on-policy vs off-policy consumption.

Catastrophic forgetting is real. Training only on new data overwrites old knowledge; solution: replay buffers, mixing a % of old data during new training.

Taxonomic annotation (labeling) is a data enrichment step that can feed into any method: SFT training pairs, constitution enhancement for DPO, pre-training data, or RAG