ML//Training

The pipeline that takes a raw model from "predicts text" to "follows instructions" to "aligns with human values".


The pipeline that takes a raw model from "predicts text" to "follows instructions" to "aligns with human values".

Rough evolution: pre-training → SFT → RLHF/DPO

Data format is algorithm-agnostic: same (prompt, chosen, rejected) triplets feed DPO, PPO, GRPO. The difference is on-policy vs off-policy consumption.

Catastrophic forgetting is real — training only on new data overwrites old knowledge; solution: replay buffers, mixing a % of old data during new training.

Taxonomic annotation (labeling) is a data enrichment step that can feed into any method: SFT training pairs, constitution enhancement for DPO, pre-training data, or RAG