When an AI system stops answering questions and starts calling tools, scheduling meetings, or moving money, the question stops being "did it hallucinate" and becomes "did someone trick it into doing the wrong thing." A new preprint called RIFT-Bench takes aim at that harder question with a methodology for stress-testing the security of autonomous AI agents, and runs it across 45 different agent implementations.
Red-teaming, in plain terms, is adversarial probing: deliberately trying to break a system the way an attacker would, so defenders can fix the holes first. The catch with "agentic" AI, meaning systems that pursue a goal by chaining together tool calls, API requests, and decisions on their own, is that there is no shared way to do that probing across the wildly different ways these systems are wired. A test that breaks one agent may be meaningless against another. RIFT-Bench's bet is that the right abstraction is a graph.
The preprint from the team that built the framework describes a two-phase pipeline. The first phase, Discovery, takes a target agentic system and extracts a hierarchical map of how it is put together: which model, which tools, which prompts, which decision points, which data it can touch. The authors call that map a NodeSpec. The point is not to read the source code. The point is to turn a black-box autonomous system into a testable artifact without needing the vendor's internals.
The second phase, Scanning, takes that graph and runs adaptive adversarial probes against it. Instead of a fixed checklist of attacks, the probes can change based on what the graph reveals. The team ran the pipeline against 45 agentic systems, used 105 adversarial probes, and generated more than 10,000 distinct attack tests. That number is meant to demonstrate generality, that the same machinery works on different architectures, not just one team's setup.
This is where the work gets interesting beyond the headline figure. The methodology is not just a vulnerability finder. It also evaluates mitigations: the same harness can compare how well a defense holds up under the same probe suite. That framing matters, because the alternative is a long string of one-off disclosures, each tied to a specific system, with no way to compare the fixes. The team has already released a companion dataset called agentic-rag-redteam-bench on Hugging Face, which applies the methodology to the agentic retrieval-augmented generation setting and is meant to give builders a concrete starting point.
The timing is not accidental. The agent ecosystem is fragmenting fast into overlapping interoperability layers, including OpenAI's tool-calling, Anthropic's Model Context Protocol (MCP), and Google's Agent-to-Agent (A2A), each a different way for agents to talk to tools and to each other. A shared evaluation layer for security is the obvious missing piece, and the preprint is explicit that RIFT-Bench sits in that gap. Whether it becomes the standard is a separate question from whether the gap exists.
A few limits are worth naming, because the paper itself does not. It is a preprint, not peer-reviewed, and the claims of "effective generalization" across 45 "heterogeneous" systems depend on how comparable those systems actually are, a point the authors do not fully settle. Coverage is broad, not exhaustive: 45 implementations is a sample, not the deployed agentic landscape. The license on the preprint is CC BY-NC-SA 4.0, which constrains commercial reuse, a relevant detail if any derivative product is planned. And the benchmark is a snapshot. Agents, defenses, and attack techniques all change, so a result from this round will not stay current without an active maintenance path.
What to watch next is whether independent security teams and agent vendors actually run the benchmark on their own systems, and whether the NodeSpec representation holds up against the more exotic agent designs now being shipped: agents that plan over long horizons, agents that call other agents, agents that carry state across sessions. The graph abstraction is the load-bearing claim. If it captures the real attack surface, the benchmark has a future. If it doesn't, the 45-system number is just a more elaborate way of testing the same kinds of agents in the same kinds of ways.