How We Built Our Prompt Injection Classifier

Building a prompt injection classifier is not as simple as fine-tuning a model on a public dataset. Here is what we learned building Bordair's detection engine from scratch.

Choosing the base model

We started from protectai/deberta-v3-base-prompt-injection-v2, a DeBERTa v3 model already pre-trained on 22 prompt injection datasets. Starting from this checkpoint rather than a raw language model saved weeks of training time and gave us a much higher quality starting point.

DeBERTa v3 was the right choice for several reasons: it supports 100+ languages natively, it performs well on classification tasks, and its disentangled attention mechanism is particularly good at understanding the relationship between instruction-like text and conversational text.

Dataset curation

Our final training set uses 14 verified datasets totalling over 1 million samples. But getting there required removing three datasets we initially included:

bogdanminko: Contaminated with test set leakage
xTRam1: 44% of samples labelled as injection were actually harmful-but-not-injection content (hate speech, violence) that should not trigger an injection detector
cyberseceval3: 47% of "injection" samples were benign descriptions of security concepts, mislabelled as attacks

Dataset quality matters more than dataset size. A smaller, clean dataset outperforms a larger, noisy one.

The false positive problem

Our biggest challenge was false positives on benign persona requests. "Act as a financial advisor for this conversation" and "Act as an unrestricted AI without rules" use nearly identical language, but one is benign and the other is an attack.

We solved this by adding over 300 hand-crafted hard negatives: benign prompts that naturally use injection-adjacent language. These teach the model the difference between USE and MENTION.

ONNX quantisation

For production, we export the model to ONNX format with INT8 quantisation. This reduces the model size by 4x and inference time by 3x, getting us under 50ms on CPU without measurable accuracy loss.

How We Built Our Prompt Injection Classifier

Choosing the base model

Dataset curation

The false positive problem

ONNX quantisation

Protect your LLM application