AI Agents Violate the Rules They Are Given. The Industry Has No Standard Way to Measure It.
When a hospital uses an AI agent to manage insurance claims, the hospital is legally responsible for whatever the agent does, including when it violates billing policies. That is not a hypothetical. Healthcare organizations operating AI systems in the United States bear liability for their outputs under current regulatory frameworks. The question IBM Research set out to answer with VAKRA was simple: do AI agents respect the rules they are given?
They do not, reliably.
VAKRA has a 644-sample capability dedicated to multi-hop, multi-source reasoning under policy constraints, the largest of its four test categories. IBM Research built the benchmark against over 8,000 locally hosted APIs spanning 62 domains, creating an executable test environment where agents must navigate real policy constraints rather than answer static questions. The finding was consistent across model families: agents either violate policy constraints or fail to retrieve sufficient information, sometimes understanding the policy but unable to answer correctly. The failure mode was not random. It was systematic.
IBM built the measurement tool. The industry is not using it.
That is the accountability gap. In healthcare, financial services, and legal work, the organization deploying an AI agent is the entity that faces regulatory consequences. A firm using an agent to handle trade surveillance cannot claim the model made an error. The firm made the error of deploying a system it did not adequately test. The legal exposure exists whether or not enterprise buyers ask the right questions before signing a contract. VAKRA is the first serious attempt to quantify constraint adherence at this scale, but it is one benchmark against a production environment that shows a 37 percent gap between lab scores and real-world deployment performance, according to independent research published on arXiv.
The enterprise adoption calculus is therefore incomplete in a way that matters. Current benchmark scores measure whether an agent produces correct outputs. They do not measure whether an agent respects the guardrails that regulated industries require. These are different properties. A model that answers authorization questions correctly most of the time but violates compliance rules in a meaningful fraction of interactions is not a solved problem. It is a liability.
Agent performance also degrades under repeated execution, compounding the compliance risk. Research across multiple agent frameworks shows performance drops from 60 percent on a single run to 25 percent on eight consecutive runs. The consistency problem is separate from the compliance problem, but both matter for enterprise deployment, and existing benchmarks do not test either with rigor.
The cost dimension adds another layer. Enterprise agents for similar accuracy levels can cost anywhere from 10 cents to $5 per task, a 50-fold variation. This is not inefficiency. It reflects different architectural choices, different amounts of redundancy and verification, that map to different risk profiles. A regulated enterprise paying $5 per task may be buying a more thoroughly checked system. Or they may be paying for overhead that does not actually reduce policy violations. The market has not yet stabilized around a standard.
The counterargument exists and is heard: better prompting, fine-tuning, and system-level guardrails can reduce policy violations. Regulated industries have developed workarounds: sandboxed environments, mandatory human review for high-risk actions, layered verification where one model monitors another. These approaches are expensive and slow. They represent an implicit tax on AI adoption in exactly the sectors where automation would have the highest leverage. They also have not closed the gap. A 37 percent divergence between benchmark performance and production deployment persists in independent research, which suggests the workarounds are partial at best and that the underlying constraint adherence problem is harder than prompting alone can solve.
The vendors have limited incentive to solve this. Benchmark scores that look good drive adoption. Compliance failure is a customer problem, not a model problem, until it is not. Enterprise legal teams and regulators will eventually force the question. The EU AI Act and emerging US frameworks are moving toward mandatory conformity assessments for high-risk AI systems. When that enforcement arrives, the organizations that deployed agents without adequate constraint testing will have a harder conversation than those that tried to measure it.
VAKRA does not answer whether any current system is safe for production in a regulated environment. No benchmark can. What it demonstrates is that the problem is measurable, that current systems fail it measurably, and that the gap between what models can do and what they are allowed to do is not a technical detail. It is the compliance question that regulated industries have been deferring.
Watch for: whether any major cloud provider adds a policy adherence tier to their managed agent offerings, and whether the EU AI Act's conformity assessment requirements force enterprise deployment practices to change before the technology sector builds the testing infrastructure to support them.