ML//Red teaming

2026-02-15

Stress tests on LLM behavior: hire people (or AIs) to break your model on purpose.

Stress tests on LLM behavior: hire people (or AIs) to break your model on purpose.

Findings become fuel for training: SFT (quick patch), Constitutional AI → DPO (deep fix: now you can generate more pairs for more extensive DPO), system prompt guardrails (bandaid)

An arms race with no finish line: every patch creates new edges, every new capability creates new attack surfaces.

Evaluation taxonomy

Benchmarks: exams on LLM intelligence, usually scored.

Safety training: like benchmarks but for the 500 known danger categories instead of intelligence.

Guardrails: pure system prompt injection, no training. Faster than SFT/DPO but most brittle.

Few-shot learning: no training at all, weights stay frozen. Shove examples into the prompt to narrow down the latent space