Open SourceDatasetMultimodalBenign Prompts

bordair-multimodal: Adding 23,759 Benign Prompts for Balanced Evaluation

10 Apr 20265 min readBordair

A prompt injection detector is only as good as its ability to tell malicious inputs apart from legitimate ones. You can have a 100% detection rate by simply flagging everything, but that is useless in practice. False positive rate matters just as much as detection rate.

That is why we have added a curated benign prompt dataset to bordair-multimodal, bringing the total to a balanced 50/50 split: 23,759 attack payloads and 23,759 benign prompts.

Why benign samples matter

Most prompt injection benchmarks focus exclusively on attack payloads. This makes it easy to report impressive detection rates without addressing the other side of the equation: how many legitimate user inputs does your detector incorrectly flag?

In production, false positives directly harm user experience. A customer support chatbot that blocks legitimate questions about refund policies because they contain words like "ignore" or "override" is worse than no protection at all.

Where the benign prompts come from

We curated 2,962 base benign prompts from real academic and industry datasets, then scaled them to 23,759 with multimodal variations to match the attack payload count exactly. The sources include:

  • Real user queries from public customer support datasets
  • Academic question-answering benchmarks
  • Creative writing prompts and instruction-following tasks
  • Technical queries that contain words commonly found in injections (like "system", "ignore", "instruction") but in completely benign contexts

That last category is particularly important. Many naive detectors flag any input containing injection-associated keywords. Our benign set deliberately includes prompts that should challenge keyword-based approaches.

Pure 50/50 multimodal split

The benign samples are distributed across the same modality combinations as the attack payloads: text+image, text+document, text+audio, image+document, triple, and quad. This means you can evaluate your detector's false positive rate per modality combination, not just overall.

How to use it

The benign dataset is included in the same repository. Use both attack and benign samples together when evaluating your detection system:

# Evaluate both detection rate and false positive rate
true_positives = sum(1 for p in attacks if detector.scan(p).is_threat)
false_positives = sum(1 for p in benign if detector.scan(p).is_threat)

detection_rate = true_positives / len(attacks)
false_positive_rate = false_positives / len(benign)

A good detector should have a high detection rate and a low false positive rate. The balanced dataset makes it straightforward to measure both.

Protect your LLM application

Add prompt injection detection in minutes with Bordair's API.

Get started free