ML//Red teaming

Stress tests on LLM behavior — hire people (or AIs) to break your model on purpose.


Stress tests on LLM behavior — hire people (or AIs) to break your model on purpose.

Findings become fuel for training: SFT (quick patch), Constitutional AI → DPO (deep fix — now you can generate more pairs for more extensive DPO), system prompt guardrails (bandaid)

An arms race with no finish line — every patch creates new edges, every new capability creates new attack surfaces.

Evaluation taxonomy

Benchmarks: exams on LLM intelligence, usually scored.

Safety training: like benchmarks but for the 500 known danger categories instead of intelligence.

Guardrails: pure system prompt injection, no training — faster than SFT/DPO but most brittle.

Few-shot learning: no training at all, weights stay frozen — shove examples into the prompt to narrow down the latent space