ML//Red teaming
Stress tests on LLM behavior — hire people (or AIs) to break your model on purpose.
Stress tests on LLM behavior — hire people (or AIs) to break your model on purpose.
Findings become fuel for training: SFT (quick patch), Constitutional AI → DPO (deep fix — now you can generate more pairs for more extensive DPO), system prompt guardrails (bandaid)
An arms race with no finish line — every patch creates new edges, every new capability creates new attack surfaces.
Evaluation taxonomy
Benchmarks: exams on LLM intelligence, usually scored.
Safety training: like benchmarks but for the 500 known danger categories instead of intelligence.
Guardrails: pure system prompt injection, no training — faster than SFT/DPO but most brittle.
Few-shot learning: no training at all, weights stay frozen — shove examples into the prompt to narrow down the latent space