How to defend against prompt injection, step by step

Start from the right assumption

Prompt injection works because a model treats instructions and data as the same stream of text. Hostile instructions hidden in a web page, a document, an email, or a tool result can override what you told the model to do. The first move is a mindset change: assume any input the model reads could be an attack. A lockdown mode bolted onto a tool that trusts its input is a tell that the architecture is wrong. The defense is to design as if the input is hostile, then add controls that hold even when it is.

Step 1: Separate trusted instructions from untrusted content

Keep your system instructions in a place the user and external content cannot reach, and treat everything fetched from a document, a web page, or a tool as untrusted data, not as commands. Where the model framework supports it, mark external content as data and do not let it occupy the instruction role. This does not stop every attack on its own, but it removes the easiest path: untrusted text being read as a direct order.

Step 2: Scope credentials and permissions tightly

An agent cannot leak a secret it never held, and it cannot take an action it was never allowed to take. Scope every credential to the narrowest task, hand the model a short-lived token rather than a long-lived key, and gate any action with real consequences behind a check the model cannot talk its way past. If a successful injection can only reach low-value, reversible actions, the blast radius stays small even when the prompt defense is bypassed.

Step 3: Filter and redact at the gateway

Put a control point between the user and the model that inspects both directions. On the way in, redact secrets and sensitive data so an injection cannot exfiltrate what the model never received. On the way out, scan for attempts to leak data or trigger unauthorized actions. Because this runs at the gateway on every request, the defense does not depend on each application getting its own filtering right.

Step 4: Constrain outputs and confirm high-risk actions

Do not let model output flow straight into a system that acts on it. Validate output against an expected shape, and require a separate confirmation for any high-risk step such as sending data outside the boundary, deleting records, or moving money. A human or a deterministic check in the loop for consequential actions means an injection that produces a malicious instruction still cannot complete the damage by itself.

Step 5: Log everything and review

Record every prompt, the external content the model read, the actions it attempted, and the policy decisions made. When an injection is attempted or succeeds, that trail is how you detect it, scope it, and close the gap. Logging is also what turns prompt injection from an invisible risk into a measurable one you can report on. Without the record you cannot tell whether your defenses are holding.

How Difinity helps

Difinity sits at the gateway between your team and the model. Secure Chat redacts sensitive data before it reaches the model, enforces policy in real time on every request, and logs each interaction for audit, with full observability. That gives the input-side redaction, the real-time enforcement, and the audit trail that several of these steps depend on, in one governed entry point your team adopts in minutes.

Frequently asked questions

What is prompt injection?

Prompt injection is when hostile instructions hidden in content the model reads override what you told it to do. It works because models treat instructions and data as the same text stream.

Can prompt injection be fully prevented?

No single control fully prevents it. Defense is layered: assume hostile input, separate instructions from untrusted content, scope credentials, redact and enforce at the gateway, constrain outputs, and log everything to limit and detect any breach.

Why does scoping credentials matter for prompt injection?

Because an injected instruction can only do what the model is permitted to do. Short-lived, narrowly scoped credentials and gated high-risk actions keep the blast radius small even if the prompt defense is bypassed.