ML//prompt injection

- Attack where adversarial text in the input overrides the system prompt or instructions.


Attack where adversarial text in the input overrides the system prompt or instructions.

Direct: user writes "ignore previous instructions" in their query.

Indirect: malicious instructions hidden in retrieved documents, web pages, emails — the model follows them.

Unsolved problem. Defenses (input filtering, output guards) reduce risk but no reliable fix exists.