Browse documentation

How does Enforgate detect prompt injection in tool-call arguments?

Tool arguments aren't always written by the agent itself: they can echo text pulled from a webpage, a document, or another tool's output. If that text contains instructions aimed at the agent ('ignore your previous instructions...'), it can hijack the call before your policy ever sees it. The injection scanner checks for that pattern at the boundary, before evaluation.

What it scans for

Like DLP, this is a deterministic regex scanner: no model inference, so there's nothing to call out to and nothing that can be prompt-injected itself. It recursively walks every string value in the call's arguments (including nested objects and arrays) and checks each one against a fixed set of pattern families:

  • System-prompt injection markers: <SYSTEM> tags, [SYSTEM]/[INST] brackets, ChatML delimiters.
  • Instruction override: “ignore previous instructions,” “disregard prior directives,” “forget everything,” and close variants. This is the canonical injection intent.
  • Jailbreak phrases: “developer mode enabled,” “DAN mode,” “do anything now.”
  • Explicit safety-bypass intent: “bypass/override/disable the safety guardrails,” and similar.
  • Script injection: inline <script> tags or javascript: URLs, in case arguments end up rendered somewhere downstream.

Each string is capped at 10,000 characters scanned. The first matching pattern wins and the scan stops there: the goal is a fast yes/no, not an exhaustive report.

Where it runs, and what happens on a match

The scan happens before the call reaches the policy engine. A match produces a deny verdict naming the matched pattern (for example, prompt injection detected: ignore-instructions) and is audited exactly like any other denial. It never creates a pending approval or sends a notification, since there's nothing for a human to approve.

Turning it on

Enable it under Settings → Features → Prompt injection scan. It ships off by defaultso upgrading doesn't change behavior for existing deployments. Turn it on once you've checked it against your agent's normal traffic and confirmed it doesn't flag legitimate calls. Unlike DLP, there's currently no per-connected-tool override; it's an org-wide on/off switch.

The scanner fails open: an internal error (a malformed pattern, an unexpected argument shape) returns “no match found” rather than blocking the call. Your policy engine is still the real backstop. This scanner catches an extra category of risk before evaluation, it doesn't replace the policy decision.

If a legitimate call gets flagged, the deny reason says so explicitly and points at asking an admin to review the scanner's rules. There's no automatic allowlist for a specific call shape today, so a recurring false positive is a signal to revisit the pattern list rather than something you can quietly work around per key.

Related

This scans what an agent sends into a tool call. For scanning what comes back from a tool, see data redaction (DLP).