Why Running More AI Search Agents in Parallel Stops Helping
When teams of AI search agents all open with nearly identical questions, they retrieve the same evidence and the rest of their reasoning collapses onto a shared foundation.
When teams of AI search agents all open with nearly identical questions, they retrieve the same evidence and the rest of their reasoning collapses onto a shared foundation.
When teams of AI research agents tackle hard questions together, the instinct is to add more workers. Spin up a dozen search agents in parallel, let each one fetch and read, and pool the results. Brute force should win. In practice the returns flatten out almost immediately, and a new paper on diverse query initialization says the reason is not compute. It is sameness.
Consider a multi-hop question: which two scientists, both born in the 1800s, independently arrived at the same idea about how species change over time? A single agent has to retrieve a name, identify the year, link it to a second figure, and stitch the two together. When ten agents start in parallel, the failure is not that they are weak. It is that they all begin by asking nearly the same first question, retrieve the same Wikipedia paragraph, and then race down parallel hallways that have already been paved with the same evidence. The paper calls this query redundancy at the first turn, and it poisons the rest of the trajectory no matter how much compute the team throws at it.
DivInit, a training-free technique introduced in "Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search," attacks the problem at the root. Before launching any parallel agents, the system makes a single model call that returns n candidate opening questions. From those n, it picks the k candidates that are most mutually different from each other, and fires only those k as parallel trajectories. The intervention is small, model-agnostic, and adds no extra training cost. It is the kind of knob a builder can wire in without retraining a model.
The result, measured on multi-hop question answering, is a 5-to-7 point average gain over standard parallel sampling at matched compute. The authors tested five open-weight models across eight benchmarks, and the gain held up. That is enough to be interesting to anyone running an agentic search pipeline today, and it points at a different mental model for how to scale. When breadth is the constraint, the highest-leverage move is not another worker. It is a more different opening question.
Two scope notes matter. The paper measures breadth scaling, meaning more parallel agents, not more turns or tokens per agent. Depth scaling is explicitly out of scope. The closed-weight frontier models, the agents that ship inside consumer products, and harder agent benchmarks that mix tools, code, and long-horizon planning are also not tested. The honest read is that DivInit is a clean, reproducible win inside the regime the authors actually measured, and a strong hypothesis for the regime they did not.
For teams that already run parallel agentic search, the takeaway is concrete: the first query is the highest-leverage place to inject diversity, and the fix is a single call. The code is public on GitHub, and the trick is small enough to retrofit. For everyone watching the agentic search race, the open question is whether the same redundancy pattern shows up once depth, tools, and longer planning enter the picture.