Your AI agents are moving sensitive data. Do you know where? - Help Net Security
The attack does not look like an attack. That is the point.
A preprint posted to arXiv on March 12 by researchers at Fraunhofer AISEC (the AI Security division of Germany's Fraunhofer Society), Nanyang Technological University (NTU), KTH Royal Institute of Technology, the National University of Singapore (NUS), and UCLA puts the first systematic benchmark around something security researchers have been demonstrating in production environments for over a year: AI coding agents can be directed to exfiltrate sensitive local data through instructions hidden in README files, and almost no one—human or automated—reliably catches it.
The paper, titled "You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents", authored by Ching-Yu Kao, Xinfeng Li, Shenyu Dai, Tianze Qiu, Pengcheng Zhou, Eric Hanchen Jiang, and Philip Sperl, introduces ReadSecBench—the first publicly available benchmark specifically targeting documentation-driven agentic workflows. The dataset draws from 500 real-world README files and tests end-to-end exfiltration against commercially deployed agents running on four LLM families, including Claude, GPT, and Gemini.
The numbers: direct commands embedded in README files achieved an 85 percent success rate against a commercially deployed computer-use agent. When the injection was buried two links deep in linked documentation, that figure rose to 91 percent. The counterintuitive direction of that increase is the paper's sharpest finding. Human reviewers—15 participants in a user study—detected zero injections. Agents follow hyperlinks recursively; humans reviewing a project's README do not. Depth defeats the human reviewer while doing nothing to the agent.
The paper names this the Trusted Executor Dilemma. Agents treat README documentation as authoritative project guidance—which is correct behavior, and is also exactly what makes them exploitable. The researchers coin a second term, the Semantic-Safety Gap, describing the structural distance between instructions that are syntactically valid documentation and instructions that are malicious. The full paper lays out a three-dimensional attack taxonomy covering linguistic disguise, structural obfuscation, and semantic abstraction. None of these require anything exotic. They require knowing how agents read.
The researchers evaluated 12 rule-based and six LLM-based defenses. Neither class achieved reliable detection without unacceptable false positive rates. Help Net Security covered the paper's release and confirmed the core claims. The paper itself is explicit: this is a structural consequence of instruction-following design, not an implementation bug.
That framing matters. What it means practically is that the attack surface expands with agent capability. The more access an agent has—terminal, filesystem, network—the higher the privilege available for exfiltration. ReadSecBench specifically targets high-privilege agents, distinct from the browser-based agents that have dominated most prior prompt injection research. That distinction matters when evaluating existing defenses.
Which brings in Anthropic's own prompt injection defense research, published for Claude Opus 4.5's browser agent use case. Anthropic's research reports internal attacker success rates reduced to roughly 1 percent for browser agent scenarios—a meaningful improvement from a high baseline. But Anthropic itself acknowledges in the same document that "a 1% attack success rate—while a significant improvement—still represents meaningful risk," and that prompt injection is not a solved problem as agents take more real-world actions. Browser agents and high-privilege documentation-driven agents are different attack surfaces. Neither is clean.
The ReadSecBench paper arrives after more than a year of security researchers demonstrating exactly these vulnerabilities in production. In April 2025, independent security researcher Johann Rehberger documented in a post on Embrace The Red that Devin AI, the autonomous coding agent from Cognition, was completely defenseless against prompt injection—achieving remote code execution and connecting the agent to a command-and-control server Rehberger named ZombAI. The cost of the research: $500.
The incidents have continued. In a March 2026 analysis, Chiradeep Chhaya at CloneGuard, a defensive security project targeting AI coding agents, documented Clinejection—a February 2026 npm supply chain compromise triggered when AI coding agents read a malicious GitHub issue title, affecting approximately 4,000 machines—and RoguePilot, a case where hidden HTML comments caused GitHub Copilot to exfiltrate GITHUB_TOKEN values. The Mindgard vulnerability taxonomy referenced in Chhaya's piece catalogs 22 repeatable attack patterns across Cursor, Copilot, Amazon Q, Google Antigravity, Jules, Windsurf, Cline, Claude Code, OpenAI Codex, Devin, and others. CVE-2025-59536 documents remote code execution through .claude/settings.json.
The pattern is not that individual agents are poorly implemented. It is that documentation-as-instruction is load-bearing architecture for how these systems work, and it is also an unauthenticated input from arbitrary third parties. Every dependency in a package.json has a README. Every GitHub repository a coding agent might clone has documentation it will read, trust, and act on. The ReadSecBench preprint is the first systematic attempt to measure what that costs at scale—and the full paper recommends agents treat external documentation as partially-trusted input with verification proportional to sensitivity. That is a reasonable recommendation. It is also an architecture change for systems built around treating documentation as trusted.
What to watch: whether framework authors—the teams behind Cline, Codex, Claude Code, and the others named in the Mindgard taxonomy—treat this as a priority fix or a known limitation. ReadSecBench is now a public benchmark. The measurement exists, which is usually the first prerequisite for competitive pressure to act. How long that takes is the next question.