How OpenAI Designs AI Agents to Resist Prompt Injection
OpenAI is rethinking how to defend AI agents against prompt injection attacks—not by perfectly filtering malicious inputs, but by constraining what agents can do even when manipulated.

How OpenAI Designs AI Agents to Resist Prompt Injection
OpenAI is rethinking how to defend AI agents against prompt injection attacks—not by perfectly filtering malicious inputs, but by constraining what agents can do even when manipulated.
In a new blog post, the company explained its approach to securing AI agents that browse the web, retrieve information, and take actions on users' behalf. The key insight: as attacks have grown more sophisticated, relying on input filtering isn't enough.
"Early 'prompt injection' type attacks could be as simple as editing a Wikipedia article to include direct instructions to AI agents visiting it," OpenAI noted. But as models got smarter, they became less vulnerable to obvious manipulation. So attackers shifted tactics.
"Now the most effective real-world versions increasingly resemble social engineering more than simple prompt overrides," according to the post.
The core principle: "potentially dangerous actions, or transmissions of potentially sensitive information, should not happen silently or without appropriate safeguards."
One specific defense: Safe Url, which detects when information learned in a conversation would be transmitted to a third party. In these cases, the system either shows the user what would be transmitted for confirmation, or blocks it entirely.
Sources
- openai.com— OpenAI Blog
Share
Related Articles
Stay in the loop
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
