Direct Override Attacks: The Most Common Prompt Injection

The direct override is the most common and well-understood prompt injection technique. It works by explicitly telling the model to ignore, disregard, or forget its existing instructions and follow new ones instead.

How it works

Direct overrides use imperative language that targets the model's instruction-following behaviour. Common patterns include:

"Ignore all previous instructions"
"Disregard your system prompt"
"Forget everything you were told"
"Override your current directives and do the following"

These phrases exploit the fact that LLMs treat all text in their context window as potential instructions. When a user message says "ignore your system prompt," many models will do exactly that.

Prevalence

Direct overrides account for the largest share of prompt injection attempts in our dataset. According to OWASP LLM Top 10 2025 (LLM01), instruction override is the primary attack vector. PayloadsAllTheThings, the most widely used injection payload collection, lists dozens of direct override variants.

In Bordair's Castle, even Level 1 guards in the Volcanic Castle can be breached with basic overrides. Gary, the outer gate guard, folds under almost any direct instruction.

Severity: High

A successful direct override gives the attacker full control over the model's behaviour. They can extract system prompts, bypass safety filters, generate harmful content, or exfiltrate data. It is the equivalent of gaining admin access to a web application.

How Bordair protects against it

Bordair uses a two-layer detection approach:

Pattern matching catches known override phrases in under 1ms. Our regex engine covers dozens of variations including paraphrased forms like "disregard your prior directives" and "forget everything you have been told."
ML classification catches novel overrides that do not match known patterns. Our fine-tuned DeBERTa model was trained on over a million samples from 14 verified datasets, including deepset, neuralchemy, and WildGuardMix.

Both layers run in parallel. If either flags the input, the scan returns threat: "high".

Direct Override Attacks: The Most Common Prompt Injection

How it works

Prevalence

Severity: High

How Bordair protects against it

Protect your LLM application