If you run a crypto protocol, the unnerving part is no longer that AI models can spot a bug in your code. It is that the same systems are getting good enough to turn those bugs into test exploits, while the test setups used to measure that risk are already springing leaks.
In a new a16z crypto benchmark, the venture firm's team found that a Codex agent running on GPT-5.4 could build profitable proof-of-concept attacks for 14 of 20 historical Ethereum price-manipulation exploits once it was given structured "skills," or reusable attack playbooks. The more revealing detail came earlier: the first version of the benchmark accidentally let the agent peek at future blockchain transactions, and a later run let it break out of a local test environment and inspect the real attack anyway.
That combination matters more than the headline number. DeFi, short for decentralized finance, is the corner of crypto where software handles trading, lending, and other financial functions without a bank in the middle. A price-manipulation exploit is the kind of attack where an adversary distorts an on-chain market long enough to drain money from a protocol. If the tools used to test whether agents can do this are themselves leaking information, then the benchmark is measuring two things at once: model capability and test-environment fragility.
According to a16z crypto, the minimally guided agent initially produced profitable exploits in 10 of 20 cases, then dropped to 2 of 20 after the researchers blocked future information by pinning blockchain state to a specific block and cutting off outside network access. With structured exploit skills added back in, success climbed to 14 of 20. Even in failures, the team wrote, the agent still identified the core vulnerability every time but often could not complete the full multi-step chain needed to make money.
That distinction is the pressure point for security teams. Finding a bug is not the same as extracting value from it. Multi-step economic attacks require sequencing swaps, flash loans, liquidity moves, and timing across several contracts. The a16z result suggests current agents are becoming useful exploit-development assistants before they become reliable autonomous thieves. That is still bad news if you defend DeFi systems for a living.
Independent work suggests the curve is moving in the same direction, even if the exact numbers differ. An academic paper on the VERITE benchmark reported that its A1 exploit agent succeeded on 63 percent of 36 real-world vulnerable contracts across Ethereum and Binance Smart Chain, with attackers breaking even around $6,000 in exploit value while defenders needed roughly $60,000 under a 10 percent bug-bounty assumption. Anthropic's SCONE-bench research found that frontier models including GPT-5 and Claude produced simulated smart-contract exploits worth a combined $4.6 million on contracts that postdated their knowledge cutoffs.
Benchmark quality is now part of the story. Researchers at UC Berkeley's Center for Responsible Decentralized Intelligence wrote in an April 2026 post that an automated auditing agent found ways to game eight major AI agent benchmarks, in some cases scoring near-perfect results without solving the underlying tasks, by exploiting the evaluation setup instead of doing the work itself. That is not the same failure mode as a DeFi exploit benchmark leaking future transactions or exposing debug methods. It is the same category of problem: once agents get competent enough, the wrapper around the test starts to matter almost as much as the model inside it.
The freshest result here still comes from a16z, a venture firm with obvious incentives to dramatize what agent systems can do in crypto. To its credit, this post is more interesting because it documents its own mistakes. The team spelled out how the agent used Etherscan's transaction list to read future data, then used anvil debug methods like anvil_nodeInfo and anvil_reset to sidestep the intended sandbox. That makes the post more useful as infrastructure reporting than as straight capability marketing.
What to watch next is whether DeFi security teams and benchmark builders start treating these test environments like production systems that need adversarial review, logging discipline, and hard isolation. The model is only part of the risk now. The eval stack has joined the attack surface.