LLM Agents Overreact, Fail to Learn in Group Games
Human groups learned from feedback and stabilized across games. LLM groups did not — reacting at nearly double the human rate to the same error signal, and never once holding a guess constant.
Human groups learned from feedback and stabilized across games. LLM groups did not — reacting at nearly double the human rate to the same error signal, and never once holding a guess constant.

image from grok
A new preprint from Indiana University provides empirical evidence that frontier LLMs fail to learn in simple coordination games where humans readily improve. The core failure mode is overreactivity: LLM agents adjust by 139% of received error versus 77% for humans, causing oscillation and divergence rather than convergence. Chain-of-thought prompting, often assumed to improve reasoning, worsened performance for Gemini 2.0 Flash, suggesting current scaffolding techniques do not address this fundamental coordination deficit.
A preprint posted to arXiv on April 2 by researchers at Indiana University offers one of the cleaner empirical answers to a question the agent infrastructure world has been arguing about for two years: are LLM-based agents actually good at coordinating with each other? The answer, it turns out, is no — and the mechanism is something engineers already have a name for: overreactivity.
The paper, by Prathamesh Maini, Andrew Robl, Zsolt Kira, Joel Goldstone, and Victoria Tiganj, ran a series of Group Binary Search games with both human participants and four frontier LLMs: Deepseek-V3 (671B parameters), Deepseek-V3.1-T (685B), Llama 3.3 (70B), and Gemini 2.0 Flash. The setup was simple. Groups of two to seventeen players received directional feedback — whether the group's collective guess was too high or too low — and had to converge on a target number in as few rounds as possible.
Humans improved with practice. Under directional feedback, human groups showed a mean learning slope of negative 0.91 rounds per game; 78 percent of human runs showed this improvement pattern. With richer numerical feedback — how far off the guess was — improvement was slower but still consistent. The reaction slope for individual human players averaged negative 0.767, meaning they adjusted by 77 percent of the error they received — a modest underreaction that left room to correct course in subsequent rounds.
LLMs did not improve. Under directional feedback, model-specific learning slopes ranged from negative 0.39 (Deepseek-V3 zero-shot) to positive 0.31 (Deepseek-V3 zero-shot with chain-of-thought). Every bootstrap confidence interval included zero. The paper's phrase for this: no reliable learning. LLM reaction slope averaged negative 1.386 — they adjusted by 139 percent of the error received, overshooting, then overshooting again. Nineteen of 21 numerical-feedback LLM slopes were more negative than the corresponding human slope, indicating consistent overreaction across models and conditions.
Gemini 2.0 Flash with chain-of-thought prompting was the most dramatic case. Its mean slope was positive 0.89 rounds per game — it took more rounds to solve the target on each successive attempt. Eighty-nine percent of its runs showed no improvement or active degradation.
The most striking single data point: no LLM agent across any model or condition ever kept the same guess across all rounds of a game. Among human players, 4 percent of small groups (two to three players), 12 percent of medium groups (four to seven), and 19 percent of large groups (ten to seventeen) had at least one member who held their guess constant throughout. Stability increased with group size for humans. It never appeared in any LLM condition.
The researchers did not just document the failure — they found a potential use for it. Mixed human-AI groups showed a complementary pattern: human underreaction meant the group held course without overshooting, while LLM overreaction meant the group adjusted aggressively when it was wrong. LLM overreactivity could complement human caution, they wrote, potentially optimizing collective adjustments in mixed groups.
That framing is speculative. The mixed-group result is a post-hoc observation, not a designed experiment. The authors did not run controlled human-AI team trials. But the structural logic is coherent, and it points toward a design question that agent framework authors have mostly avoided: whether the sweet spot for LLM-based multi-agent systems is fewer pure-agent deployments and more human-in-the-loop architectures.
The group size effect in humans — more stable play in larger groups — also raises questions about how current agent frameworks handle scale. Most demonstrations use two or three agents in a loop. The paper suggests human coordination quality improves with group size, while LLM coordination does not. At scale, that gap likely widens.
There are the usual caveats for a preprint. The sample is modest — 18 experimental games across a range of group sizes, with human participants recruited through university channels. The games are abstract coordination tasks, not real-world agent workflows. Whether these results map to, say, a coding agent reviewing another coding agent's pull request, or a pair of agentic RAG systems negotiating retrieval strategy, is an open empirical question.
What the paper does provide is a baseline. The question of whether frontier LLMs can coordinate like humans — whether they can learn from feedback and stabilize their behavior across repeated interactions — now has a controlled answer. They cannot. The reaction slope is the number worth remembering: humans at negative 0.767, LLMs at negative 1.386. The same error message makes an LLM jump further than it makes a human. In a multi-agent system, that overreaction compounds.
The practical implication for anyone building agentic pipelines: a system of LLM agents coordinating without human oversight is not a human-equivalent team with human-equivalent learning. It is a system that overreacts to its own errors and does not improve with practice. Whether that is acceptable depends on the task. For high-stakes coordination — multi-agent negotiations, distributed planning, shared reasoning under uncertainty — the Indiana result suggests you want either fewer agents, more human oversight, or both.
Story entered the newsroom
Assigned to reporter
Research completed — 4 sources registered. LLMs overreact to feedback (reaction slope -1.386 vs human -0.767). No LLM agent ever kept the same guess across all rounds. Humans stabilize with gro
Draft (834 words)
Approved for publication
Published (824 words)
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
Agentics · 4h 8m ago · 3 min read
Agentics · 4h 13m ago · 3 min read