The Automation-Surprise Problem Has a New Name: Multi-Agent AI
The automation-surprise problem — opacity plus autonomy equals accidents — has been waiting for a new domain to repeat it.
Multi-agent AI systems, where multiple AI agents coordinate to plan and execute tasks without humans in the loop for every step, are arriving in production before anyone has mapped where that oversight should live. A paper published this week from Megagon Labs and Penn State proposes the most systematic answer so far: a taxonomy with three axes that maps the full design space of how humans can stay involved when AI agents do the planning. The paper is accepted at CAIS 2026, which begins in San Jose on May 26.
The framework, called AMBIPOM, was developed by Zeyu He, Hannah Kim, Dan Zhang, and Estevam Hruschka at Megagon Labs, building on their earlier AIPOM work from EMNLP 2025. It organizes human-LLM co-planning across three decisions: mode — whether a human and an AI agent compare plans semantically, by meaning, or structurally, by step-by-step layout; scope — whether a human adjusts the whole plan at once or individual steps within it; and level — whether the human works at a high level, reshaping the plan's logic, or at a low level, changing specific actions.
The practical difference is not abstract. Consider a multi-agent system that automates software deployment: an orchestrator agent writes a deployment plan, and a human needs to catch a mistake before it runs. Under mode=semantic, scope=global, level=high, the human sees only "this plan deploys to production" and must infer from that description whether something is wrong. Under mode=structural, scope=targeted, level=low, the human sees the exact sequence of commands — mkdir, chmod, rsync, restart — and can flag a command that should not be there. The same error is caught two different ways, requiring two different levels of trust in the AI's self-description.
The authors ran a user study with roughly 30 participants across different human-LLM configurations. Participants using semantic mode — reading plans in plain language — caught goal-level errors faster but missed structural mistakes the AI buried in the plan's steps. Participants using structural mode — seeing the full sequence of actions — caught command-level errors but sometimes missed whether the overall goal made sense. The finding: no single mode/scope/level combination dominated. The right choice depended on what kind of error mattered most in a given workflow.
The pressure behind the paper is not abstract. Aviation learned the automation-surprise lesson through a string of accidents that are now standard references in human-factors research. Air France Flight 447 crashed in 2009 over the Atlantic, killing 228 people, after pilots lost awareness of what the autopilot was doing at 35,000 feet. The 737 MAX crashed in 2018 and 2019, killing 346 people combined, after an automated flight-control system pushed the nose down based on a single faulty sensor. In each case, automation opacity — pilots who could see outputs but not reasoning — produced situations where human judgment diverged from aircraft state. The regulatory and design response, formalized through NASA/FAA human-factors programs, established that pilots needed to retain what the industry calls "remaining in the loop": not just the ability to override, but the information needed to override meaningfully.
Multi-agent AI systems face a structurally similar problem. Existing frameworks for building multi-agent systems — AutoGen, CrewAI, LangGraph — give developers orchestration primitives and tool-calling logic, but they do not provide a structured vocabulary for deciding where a human should be in the loop, at what granularity, and with what information. A developer building a multi-agent customer-service pipeline today makes those decisions implicitly, often by defaulting to human-in-the-loop everywhere or nowhere. AMBIPOM's three axes are an attempt to make those design decisions explicit — to give practitioners a map instead of forcing each team to rediscover the tradeoffs through production incidents.
The GitHub repository implementing the framework has zero stars and a single commit — no signal yet that practitioners have found a reason to adopt it. But the taxonomy's value does not depend on uptake: the questions it asks are the questions every multi-agent developer is already answering, mostly without knowing it. The framework gives language to a class of design decisions that are currently made implicitly, in product specs and architecture docs, and that will surface as incidents when those decisions turn out to be wrong. That is why the CAIS 2026 conference next week matters — not because the framework will be adopted, but because the conversation about what it means will start there.