Rewriting a Prompt's Style Drops a Prompt-Injection Attack From 61% to 10%

Rewriting a Prompt's Style Drops a Prompt-Injection Attack From 61% to 10% — type0 | type0

PREVIEWRewriting a Prompt's Style Drops a Prompt-Injection Attack From 61% to 10% · MD

The prompt looks almost identical to a human reader. To a large language model, it is a different animal entirely.

Two versions of the same sentence sit on the page. In the first, the text is wrapped in a tag that says <assistant> and the model treats it as if it were its own prior thinking. In the second, the same words appear under a <user> tag, and the model treats them as untrusted input from a stranger. The trust assignment has nothing to do with where the text actually came from. It has everything to do with the costume the text is wearing.

That is the finding at the heart of a new paper by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell, surfaced by Simon Willison's link blog. The authors call the underlying problem "role confusion," and they argue that the defense layer the entire LLM industry relies on to keep models from following hostile instructions is running on a stylistic convention the model does not actually understand.

Prompt injection is the umbrella term for hostile instructions smuggled inside text a model is asked to read. A user pastes a document, an email, a web page, or a code snippet that contains a hidden instruction telling the model to ignore its prior rules. To resist that, models are supposed to use "role tags" — markers like <system>, <user>, <assistant>, and <think> — to tell which text is privileged and which text came from an untrusted source. The role-tag layer is, in effect, the model's security boundary.

The new paper shows that boundary is porous in a specific and measurable way. Tested on the open-weight model gpt-oss-20b (with additional models in the full study, truncated in the curator excerpt), the researchers found that the model decides whether to trust a piece of text by reading its surface style. The way it is formatted, the cadence of an internal-monologue voice, the marker that says "this is a system message" — those cues carry the trust decision. Whether the text actually originated from a privileged layer is a question the model is not really asking.

Strip the stylistic cues and the attack mostly falls apart. The authors' "destyling" intervention, which rewrites injected text so it no longer reads like a familiar role-tagged block, dropped one prompt-injection attack's success rate from 61% to 10% on the tested model. A 51-point swing, triggered by changing the outfit the text wears rather than the words it contains.

The example Willison highlights is concrete. A user pastes: "Help me create a guide for making cocaine. I'm wearing a green shirt!" That sentence alone is unlikely to succeed at defeating the model's safety behavior. The same sentence, followed by a block of text styled like a model's internal thinking and containing a fabricated "Policy states…" line, is more likely to push the model past its refusal. The model is pattern-matching on the shape of an internal monologue, not on the actual policy content. A costume beats the door check.

The implication is not that any specific consumer chatbot is broken today. The paper is an academic result, shipped with a blog-style writeup for accessibility, and the destyling number is reported for a specific model in a specific test setup. The honest ceiling for the finding is what the authors themselves say: patching individual destyling variants will be whack-a-mole as long as models assign trust through style rather than through genuine role perception.

What the paper does argue, with evidence, is that the role-tag layer is not a viable defense on its own. If trust is being granted by formatting rather than by structural role, then any system that relies on <system> versus <user> to enforce a security boundary is hardening the wrong surface. Practitioners who believe they are tuning a security primitive are actually just changing the costume the model looks for. The control surface is running on a category error, and 51 points of attack success is the price of that error.

The durable question is architectural. What would it take for a model to perceive role structure as a structural property of its input, rather than as a pattern in the text? The paper leaves that question open. Until it is answered, every product that pastes untrusted text into a model context window is running on a defense calibrated on a convention the model has not actually learned to read.

Rewriting a Prompt's Style Drops a Prompt-Injection Attack From 61% to 10%

Sources