When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines
Adding a weaker model to your agent team might make it smarter and cheaper. That is the strangest finding in a preprint from March 20 by independent researcher Artem Maryanskyy that most people building multi-agent LLM pipelines are going to want to read carefully.
The paper, titled "When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines" and posted to arXiv, runs 42 tasks across seven categories with 210 total evaluations to answer a question the field has gotten wrong: does model diversity in a team of agents actually help? The short answer, according to this research, is: it depends. Specifically, it depends on how good your selection mechanism is — and in the experimental conditions tested, that quality determines whether diversity helps or hurts performance.
The numbers are stark. A diverse team of three frontier models using judge-based selection achieved a 0.810 win rate against a single strong baseline, according to the paper (arXiv:2603.20324). The same diverse team using synthesis-based aggregation achieved a 0.179 win rate, losing to the single model in every one of the 42 tasks. A majority vote approach hit 0.496, which is essentially chance.
The Crossover Threshold
Maryanskyy resolves a genuine contradiction in the multi-agent literature. A 2024 paper by Wang et al. showed heterogeneous mixture-of-agents beating single models (arXiv:2406.04692v1). A 2025 paper by Li et al. showed the opposite under controlled conditions (arXiv:2502.00674). Both were right about different sides of a crossover governed by what Maryanskyy calls the selection quality, denoted s, scaled from 0 to 1, where 0 is random and 1 is a perfect oracle.
The crossover threshold s is defined formally in Proposition 1 of the paper and describes the point below which diversity hurts and above which it helps. Synthesis-based aggregation operates near s=0, essentially outputting a team mean. At that regime, a diverse team outputs average toward the middle, destroying the variance that makes diversity valuable in the first place. Judge-based selection, by contrast, operates well above s and can actually extract the upside.
The implication, as Maryanskyy frames it, is that for agentic pipelines the selection mechanism is the lever — not generator diversity. That is a clean inversion of how most teams think about the problem.
The Weak-Model Paradox
The counterintuitive result is the exploratory finding involving Claude Haiku, Anthropic's budget model. Adding Haiku to a diverse team alongside Claude Opus and Gemini 2.5 Pro was associated with higher win rates and lower inference cost. The mechanism: a weaker model from a different capability tier introduces orthogonal error patterns that raise the team's oracle ceiling even as they lower the team mean. A strong selector captures the upside and ignores the downside.
This is not pre-registered as a primary hypothesis and should be treated as an observational finding rather than a design prescription. But it inverts conventional wisdom about team composition. You may not need to pay for three frontier models.
The judge panel itself was carefully separated from the agent pool. Anthropic's Claude Sonnet served as judge alongside GPT-5-mini and DeepSeek-V3p2, with strict zero overlap between judge and generator roles. The same-family concern is acknowledged: Sonnet and Opus are both Anthropic models. A decoupled evaluation pass using Gemini 2.0 Flash and GLM-5 as independent judges confirmed the directional findings, with Spearman correlation of 0.90 across all conditions, though effect sizes were attenuated by 53 to 67 percent.
Why Majority Vote Fails for Open-Ended Generation
The majority vote result, sitting at 0.496 across 42 tasks, is worth dwelling on because the intuition behind it is reasonable. The Condorcet Jury Theorem predicts that majority voting improves accuracy as team size grows, provided there is a well-defined correctness criterion. For code execution or math problems, that condition holds. For open-ended generation, it does not. When "correct" is not well-defined, voting collapses to what amounts to a coin flip.
LLM-Blender (GitHub: avnlp/llm-blender), a related system for selecting among LLM outputs, frames this as a ranker problem rather than a synthesis problem. The distinction matters: blending throws away the variance that diversity creates, while choosing does not.
The Irony of Agents Selecting Among Agents
There is something quietly funny about a preprint that establishes, with controlled experiments and a closed-form proof, that the bottleneck in AI agent pipelines is the mechanism by which AI agents choose among AI agents' outputs. The whole stack is bootstrapped: judges evaluating judges, selection quality measured against ground truth that other models produced. The paper does not linger on this, but it is not wrong to notice it.
No code repository has been published alongside the preprint yet. The paper is 12 pages with three figures and five tables, submitted on March 20, 2026, with no press coverage prior to this story. It has not been peer reviewed.