When you throw multiple AI agents at a research problem, you get a choice: let them search in parallel and pick the best result, or have them hand off to specialized experts before touching the codebase. Yang Shen and colleagues at four universities tested both approaches on automated machine learning optimization and found that neither wins universally. The architecture that works depends on how much time you have and how hard the problem is — and the failure modes of each are instructively different, as their paper posted to arXiv on March 31 makes clear.
The researchers built a testbed they call MAAR (Multi-Agent Automated Research) using Git worktree isolation, so each agent or team works on a clean copy of the codebase without trampling on others. They then ran two distinct multi-agent architectures against the same autoresearch task — optimizing a neural network's validation loss under fixed time budgets.
The subagent approach uses parallel workers that explore independently, then a coordinator merges the most promising patches. Think of it as breadth-first search with a central arbiter. Under a 300-second budget, this mode produced seven effective improvements across 50 rounds. It is fast and resilient: workers run concurrently, and even if one goes wrong the others keep going. But the researchers observed a consistent failure pattern. The coordinator kept exploiting a single hyperparameter — the MLP expansion ratio — instead of exploring other architectural dimensions. Sequentially scaling it from 4×4 down to 0.75× across multiple rounds is not a sign of intelligence. It is a greedy trap. The system found one thing that works and kept squeezing it.
The agent team approach works differently. Three fixed-role experts — an Architecture expert, an Optimizer and Schedule expert, and an Efficiency and Memory expert — communicate before any code is written. The hand-off happens pre-execution, not post-hoc. Under the same 300-second budget, this mode produced only three effective improvements. It is slower and more fragile: the multi-author code generation introduces conflicts that can break the build. But when it works, the changes are qualitatively deeper. The researchers observed coupled modifications — attention patterns, learning rates, vocabulary size — that a parallel system would not discover because no single worker sees the whole picture.
The fundamental tension is operational stability versus theoretical deliberation. Parallel subagents are high-throughput but shallow. Expert handoffs are low-throughput but deep. Neither is universally better.
This finding maps onto what practitioners already knew anecdotally. AWS's DevOps Agent, which went generally available on March 31, is optimized for fast incident investigation — a shallow, time-constrained task where parallel exploration makes sense. Meanwhile, systems designed for complex architectural refactoring tend to involve more deliberate, expert-driven handoffs — slower, but better suited to problems where coupling matters. The distinction shows up in production behavior: agents that can investigate often cannot act, and agents built for deep refactoring often cannot stay stable under time pressure.
What Shen et al. argue is that the field should stop choosing a coordination architecture at project start and instead route tasks dynamically. Simple, shallow improvements go to parallel subagents. Complex, coupled changes go to expert teams. The routing decision is made at runtime based on task complexity, not human intuition at kickoff.
The researchers tested this intuition on FARS (Fully Automated Research System), an end-to-end AI research pipeline from AI Scientist developer Analemma AI. FARS ran for 417 hours, consumed 21.6 billion tokens, and spent $186,000 to generate 166 papers. That compute footprint is extreme — roughly what a mid-size research lab spends on a quarter of cloud infrastructure — but it establishes that the coordination question is not theoretical. Real systems are running these experiments at real cost, and the coordination architecture determines whether that spend produces breadth or depth.
There are limits to what a single benchmark reveals. The testbed optimizes neural network hyperparameters on a fixed codebase — a well-defined search problem with measurable progress. Real automated research is messier: the goalposts move, the evaluation criteria are ambiguous, and the coupling between changes is often unknown until after the fact. Whether the subagent greediness problem persists in open-ended research domains is an open question. The authors acknowledge this and frame their work as establishing empirical baselines, not solving the coordination problem.
The implication for builders is concrete. If you are building an autoresearch system and choosing between parallel subagents and expert teams, the answer is not architectural — it is economic. Parallel subagents are cheap per-round and good for things that respond to greedy optimization. Expert teams are expensive per-round and good for things that require coordinated changes across dimensions. The smart move is to build the router, not pick a side.