Your company is evaluating a multi-agent AI system. The benchmark results look excellent: high task completion rates, smooth coordination between agents, accurate outputs across the test suite. Before you sign the contract, you should know what the benchmarks are not measuring: whether any of this holds up when the system encounters a domain it was not trained on.
That gap between benchmark performance and real-world generalization is the finding at the center of a new empirical study from researchers including Namyoung So and colleagues, posted to arXiv. The study tested three multi-agent frameworks (ChatDev, GPTSwarm, and AgentVerse) across tasks designed to simulate domain shift: situations where the system encounters inputs or contexts outside its training distribution. The results are a warning for anyone betting enterprise infrastructure on current multi-agent architectures.
On standard benchmarks, all three systems performed well. But when the researchers introduced domain-shifted conditions, the agent equivalent of giving a radiologist X-rays from a different machine without warning, the systems continued reporting high confidence even as the actual coordination between agents fell apart. The researchers call this "illusory coordination": the visible outputs look fine, but the internal reasoning chain that produced them has drifted from what the task actually required. Surface accuracy masks internal failure.
The finding aligns with independent research from DeepMind's Science of Scaling project, which in 2025 documented error amplification of up to 17.2 times in unstructured agent configurations, what researchers call a "bag of agents" with no governing topology coordinating how individual agents share context or resolve conflicts. Structured topologies, where the information flow between agents follows defined rules, showed significantly lower error rates under the same conditions.
What makes the illusory coordination finding uncomfortable for enterprise buyers is that it is not visible in standard benchmark reporting. A vendor demo or an internal pilot will show high completion rates, because the demo environment is itself a training-domain analog. The failure only appears when the system meets a genuinely new context. By then, the contract is signed.
The paper's scope is limited to three academic multi-agent frameworks, not production enterprise systems. Whether systems like Salesforce Agentforce, Microsoft Copilot Agents, or other commercial platforms exhibit the same failure mode is an open question the paper does not address. The authors tested what existed in the open research ecosystem; the commercial landscape is a different and less transparent one.
Academic researchers have begun proposing detection methods, frameworks for stress-testing multi-agent systems against domain-shifted inputs before deployment, but these are not yet standard practice in enterprise procurement. The gap between what the research community has documented and what enterprise evaluation processes actually measure remains substantial, as Towards Data Science reported in covering the bag-of-agents failure mode.
Whether enterprise buyers are already using cross-domain evaluation and are aware of these failure rates is the question that would most directly affect the stakes of this research. It is also the one the current evidence cannot answer. The next time your team reviews a multi-agent system benchmark, the relevant test may be the one that was not in the benchmark.