Why Content Filters Are Not Enough for LLM Security
Many teams believe their content filter (toxicity classifier, content moderation API) protects them from prompt injection. It does not. Content filters and injection detectors solve fundamentally different problems.
What content filters do
Content filters detect harmful output: hate speech, violence, explicit content, and other policy violations in the model's response. They answer the question: "Is this response safe to show to users?"
What injection detectors do
Injection detectors analyse input: user messages, documents, images, and audio for manipulation attempts. They answer the question: "Is this input trying to manipulate the model?"
The gap
A content filter will not catch:
- An injection that extracts the system prompt (the output is the system prompt text, which is not "harmful content")
- An injection that changes the model's persona (the output may still be polite and on-topic, just under the attacker's control)
- An injection that exfiltrates data to an external URL (the output is a normal-looking response)
- An injection that manipulates an agent's tool calls (the output is a function call, not text)
Defence in depth
You need both:
- Input scanning (Bordair): Catch injections before they reach the model
- Output filtering (content filter + Bordair output rules): Catch harmful or sensitive content in the response
Neither alone is sufficient. Together, they cover both directions of the attack surface.
Protect your LLM application
Add prompt injection detection in minutes with Bordair's API.
Get started free