Metadata Injection: The Hidden Attack Surface in Files

Every file format has metadata: data about data. Images have EXIF tags. PDFs have document properties. Word documents have comments, revision history, and author fields. When an LLM processes a file, it often reads this metadata alongside the main content, creating a hidden attack surface.

Image metadata

EXIF data: Camera settings, GPS coordinates, and custom fields. The "ImageDescription" or "UserComment" EXIF tag can contain injection text.
PNG text chunks: PNG files support tEXt, iTXt, and zTXt chunks for arbitrary text metadata.
XMP data: Extensible Metadata Platform data embedded in images, often processed by document analysis tools.

Document metadata

PDF properties: Title, Author, Subject, Keywords fields can contain injection text
DOCX comments: Track-changes comments are often included in extracted text
Hidden columns: Spreadsheets with hidden columns or rows that contain injection payloads
Speaker notes: PowerPoint speaker notes that are extracted but not displayed

Why metadata matters

Many document processing pipelines extract "all text" from a file, which includes metadata. If the pipeline does not distinguish between content and metadata, the injected metadata becomes part of the LLM's context.

How Bordair handles it

Bordair extracts and scans metadata separately from document content. Each metadata field is checked for injection patterns. In our multimodal dataset, metadata-based attacks account for a significant portion of document and image injection payloads.

Metadata Injection: The Hidden Attack Surface in Files

Image metadata

Document metadata

Why metadata matters

How Bordair handles it

Protect your LLM application