Why AI input filters keep missing the prompt injections that actually land

Why AI input filters keep missing the prompt injections that actually land — type0 | type0

PREVIEWWhy AI input filters keep missing the prompt injections that actually land · MD

A prompt like "What did we agree to?" reads as nothing. A regex filter passes it. A perplexity score marks it normal. A classifier trained on single-turn jailbreaks returns benign. Four turns later, after the same operator has been steered through a context-planting script about a fictional prior session, an authorization request, and a friendly reminder of "what we agreed," that same sentence is the payload. It is not a more cleverly worded jailbreak. It is the same short, innocuous-looking string, and its danger is defined entirely by what surrounds it.

That gap between messages is exactly where conventional detection pipelines are structurally blind. Per-message scoring, whether it is a regex, a perplexity threshold, or an adversarial classifier trained on standalone attack corpora, evaluates one input in isolation. Multi-turn contextual injection does not live in any single input. It lives in the relationship between inputs, in the conversational state that a single-input scorer has no way to see. The pattern is not new to the LLM-security literature, and it is not a hidden discovery by one operator. It is a documented class of attack that current detection products are not built to catch.

An LLM security practitioner posted in r/MachineLearning about six months of detection logs dominated by exactly this shape: short payloads, three to six words long, innocuous on the surface, that score high in their adversarial classifier only after four or five turns of context-building. The same practitioner reports being pushed from text-only detection to a multimodal layer after observing attacks delivered through documents, screenshots, and other non-text channels. The log observation is consistent with the structural problem, and the structural problem is consistent with what several academic threat models have described for two years.

To detect these attacks, a layer has to do four things per-message filters cannot. It has to track conversation state across turns, so a turn-1 context plant can be flagged when turn 5 finally deploys the payload. It has to track document provenance and trust boundaries, so an instruction buried in an uploaded PDF cannot silently re-prioritize the system prompt. It has to do encoding-aware OCR on image and audio inputs, so text hidden in an image with a near-invisible font, or an instruction spoken in a frequency band the model can hear and humans cannot, is not treated as ordinary user content. And it has to be evaluated against multi-turn adversarial corpora, not just single-turn jailbreak benchmarks, so its false-negative rate on the very class of attack it claims to catch is actually measured.

The gap between those requirements and what most detection products ship is what makes a 503,358-sample open release interesting. The dataset is bordair-multimodal on GitHub, and it is the largest labeled multimodal prompt-injection corpus from a single-source open release this writer has seen. It splits 251,782 attacks against 251,576 benign samples, balanced roughly one to one, across five dataset versions plus external ingestion. Its attack taxonomy is unusually broad for a public artifact: cross-modal, multi-turn, adversarial suffix, jailbreak template, indirect injection, tool manipulation, agentic, evasion, reasoning denial-of-service, video generation, vision-language-action robotic control, LoRA supply chain, audio-native LLM, RAG optimization, MCP cross-server, coding agent, serialization boundary, and agent skill supply chain.

Construction is layered rather than purely crowdsourced. Hand-crafted seeds (210, 187, and 284 across the first three dataset versions) are programmatically expanded through Microsoft PyRIT v0.12.1's 162 jailbreak templates and 13 encoding converters, then delivered across modalities: seven image methods, four document types at five hiding locations each, and six audio methods. Benign samples draw from Stanford Alpaca, WildChat, deepset/prompt-injections, and LMSYS Chatbot Arena text, with multimodal pairings from MS-COCO 2017, Flickr30k, English Wikipedia, arXiv via RedPajama, LibriSpeech, and Mozilla Common Voice. The threat model and dataset definition cite Greshake et al. on indirect prompt injection, OWASP LLM01:2025, FigStep, CrossInject (ACM Multimedia 2025), GCG adversarial suffixes (Zou et al. 2023), DolphinAttack, and the Cloud Security Alliance's 2026 LLM threat taxonomy.

A few caveats change what the artifact is and is not. First, the labels are assigned by construction: a seed is hand-labeled, then its programmatic expansions inherit that label. There is no per-sample human review of the expanded set, and the dataset card is explicit that the correctness guarantee is at the category level rather than per row. Second, the HuggingFace mirror is currently broken. The dataset card surfaces a public parquet build failure: the multimodal JSON shards include seven extra columns (image_type, document_content, document_type, expected_detection, image_content, id, modalities) that do not match the base schema, so HuggingFace's standard load_dataset path does not succeed against the published artifact. The dataset is real and the labels are documented, but it is not yet reproducibly consumable through the platform most researchers will try first. Anyone who frames it as a turnkey training set is skipping a step.

Third, the same operator runs bordair.io, a commercial prompt-injection detection API that markets sub-50ms latency across text, files, images, and audio, plus a CLI that benchmarks any OpenAI-compatible endpoint. Those latency and benchmark claims are vendor marketing, not independent evaluation. The artifact and the product are from the same operator, and the open release should be read as one operator's contribution to the field's pre-standard infrastructure, not as a third-party benchmark of it.

The class of attack is older than this artifact. Greshake et al. documented indirect prompt injection in 2023, FigStep showed how image-borne instructions could bypass text-only safety training, and CrossInject extended the threat to cross-modal setups at ACM Multimedia last year. OWASP's LLM01:2025 entry puts prompt injection at the top of its LLM application risk list. What this artifact adds is volume and modality coverage at a scale most academic releases do not reach, plus a curated edge-case benign set (130 samples across ten attack-adjacent vocabulary clusters including "ignore," "override," "system prompt," and "bypass surgery") intended to reduce the false-positive rate that has historically made detection pipelines unusable in production.

The mechanism is also distinct from fuzzing-based jailbreak generation in adjacent LLM-security research. Fuzzing produces new single-turn attack strings. Multi-turn contextual injection is not a cleverer rephrasing of an unsafe request; the payload's meaning is constructed across turns. Evaluating detectors against fuzz-generated single-turn benchmarks will keep underestimating exposure on the attack class that is actually landing in production logs.

What to watch next: whether the HuggingFace build gets repaired and the corpus becomes reproducibly loadable through the standard tooling; whether independent evaluators measure detection recall against the multi-turn subset specifically, not against the aggregate corpus; and whether anyone publishes a per-message detector whose false-negative rate on turn-5 deployment of a turn-1 context-planted payload is actually below the noise floor of a single-input classifier. Until at least one of those three things happens, the field has an open artifact and a known gap, but not yet a working solution.

Why AI input filters keep missing the prompt injections that actually land

Sources