From Bolted-On to Built-In: A Five-Checkpoint Pattern for Enterprise AI Agents
Most enterprises can't govern the AI agents they're already running, and IBM Research's demo paper lays out a model agnostic, policy as code path past the gap.
Most enterprises can't govern the AI agents they're already running, and IBM Research's demo paper lays out a model agnostic, policy as code path past the gap.
Most large enterprises will be running more than a thousand AI agents by the end of 2026. Most of the executives in charge of those agents do not trust the governance around them. The gap between those two facts is what IBM Research's new demo paper, arXiv preprint 2605.20874, "Governance by Construction for Generalist Agents," is trying to close. Submitted 20 May 2026 and listed for ACM CAIS 2026, the paper argues that enterprise agent failures are not model failures. They are architectural. Governance is bolted on after the fact, and the harness is where the work has to change.
IBM's enterprise survey, presented at Think 2026 and summarized by Beam AI, projects roughly 1,600 AI agents per large organization by year-end. Seventy percent of executives in that dataset call their current AI governance not fit for purpose. Only 18% maintain a complete, current inventory of the agents they are running. Only 12% have a centralized platform to manage the sprawl. Beam AI, paraphrasing the same IBM material, puts concern about agent sprawl at 94% of organizations.
The deeper concern, per Beam AI's summary of IBM IBV data, is integration. Sixty-eight percent of executives worry their AI initiatives will fail because of insufficient deep integration between the agents, the systems they touch, and the governance layer above them. The worry is not that the model hallucinates. The worry is that the governance layer does not exist, or it is decoupled from where the agent actually acts. The same dataset reports that organizations using orchestration-led governance were 13 times more likely to be scaling their AI practice and saw 30% fewer irregularities than peers. That is the trade the IBM numbers are actually pointing at.
The constructive answer, in IBM Research's framing, is governance by construction. The architecture is called CUGA, and the open-source repository describes it as a generalist agent harness for the enterprise, with MCP, OpenAPI, and LangChain tool wiring, multiple reasoning modes, supervisor and multi-agent orchestration, and A2A support. The point of the design is that the model is swappable. The governance is not. The governance is a property of the system the model is plugged into.
CUGA enforces policy at five structural checkpoints. The Intent Guard sits upstream of planning and screens the user's request before the agent starts reasoning. The Playbook is a system-prompt section that steers the agent's in-context reasoning toward the policies the organization has actually approved. The Tool Guide wraps tool calls and restricts which functions the agent can invoke in a given context. Tool Approvals is a human-in-the-loop gate, sitting outside the reasoning loop, that pauses high-risk actions for a human reviewer. The Output Formatter shapes the final response, applying formatting and disclosure rules before anything reaches the user. Each module is documented in the CUGA repository, and each one catches a class of failure that an after-the-fact policy review would only see after damage was done.
The paper's worked example is a healthcare scenario. A clinician asks the agent to schedule a procedure and update the patient's chart. The agent plans a sequence of tool calls. The Intent Guard flags a request that combines a record update with an action requiring explicit consent, and routes the action sequence into a dynamic playbook that enforces the right approval flow. The Tool Guide blocks a destructive write the playbook did not pre-approve. Tool Approvals pauses the action until the clinician confirms. The Output Formatter rewrites the final message so the patient-facing language matches the clinic's consent disclosures. None of that is model retraining. All of it is harness design.
The model-agnostic claim is the part practitioners should test first. A generalist LLM sits behind the policy layer, and the same architecture is meant to work with a different model tomorrow, or a smaller one today, without rewriting the policies. That is what "governance by construction" means in practice. The governance is not a property of the model. It is a property of the harness.
The architecture is necessary, not sufficient. Policy-as-code is still policy. Someone has to author the playbooks, and authoring playbooks is organizational work, not engineering work. Someone has to staff the human-in-the-loop gates, and "human in the loop" is often a euphemism for an overworked security team with a Slack channel. Someone has to keep the agent inventory current — which means the remaining 82% of executives in the same IBM dataset, those who do not maintain a complete inventory, may be failing at precisely the governance prerequisite the architecture depends on. The 13x scaling and 30% fewer irregularities numbers depend on the organization actually running the orchestration.
A Medium post by Sami Marreed credits CUGA with benchmark rankings on AppWorld and WebArena, framing the results as what the architecture enables. The specific rankings are the author's representation; independent verification against current public leaderboards is not available.
What to watch next. The IBM Research publication page for the paper is the right place to track ACM CAIS 2026 status and any production deployments. The interesting question is not whether generalist agents can be made governable. The IBM demo shows they can. The interesting question is which organizations can do the policy authoring, the human review, and the inventory work fast enough to keep up with the agents they are already shipping.