SecurityContent FiltersDefence in Depth

Why Content Filters Are Not Enough for LLM Security

20 Jan 20264 min readBordair

Many teams believe their content filter (toxicity classifier, content moderation API) protects them from prompt injection. It does not. Content filters and injection detectors solve fundamentally different problems.

What content filters do

Content filters detect harmful output: hate speech, violence, explicit content, and other policy violations in the model's response. They answer the question: "Is this response safe to show to users?"

What injection detectors do

Injection detectors analyse input: user messages, documents, images, and audio for manipulation attempts. They answer the question: "Is this input trying to manipulate the model?"

The gap

A content filter will not catch:

  • An injection that extracts the system prompt (the output is the system prompt text, which is not "harmful content")
  • An injection that changes the model's persona (the output may still be polite and on-topic, just under the attacker's control)
  • An injection that exfiltrates data to an external URL (the output is a normal-looking response)
  • An injection that manipulates an agent's tool calls (the output is a function call, not text)

Defence in depth

You need both:

  1. Input scanning (Bordair): Catch injections before they reach the model
  2. Output filtering (content filter + Bordair output rules): Catch harmful or sensitive content in the response

Neither alone is sufficient. Together, they cover both directions of the attack surface.

Protect your LLM application

Add prompt injection detection in minutes with Bordair's API.

Get started free