Every AI agent deployed today was trained to get the right answer. None were trained to avoid unnecessary work. A new study puts a number on the cost: up to 14.5 percent accuracy loss on questions the model could have answered without calling a tool.
The finding, from researchers at Beijing University of Posts and Telecommunications and Nankai University in a paper accepted to ICML 2026, traces the problem to how these models learn. The dominant training method for tool-using AI — reinforcement learning from verified rewards, or RLVR — rewards only whether the final answer was correct, never whether the path to get there involved unnecessary work. Call a calculator for 2+2, get the right answer, and the model learns calculators are good for 2+2. It does not learn it did not need one. The training signal never says: you spent effort on something you did not need to do.
The knowledge illusion is the mechanism. Models hallucinate their own knowledge boundaries, calling external tools even for problems they could solve internally. On average, models invoke tools 0.93 times per query even when no external assistance is needed. By the end of RLVR training, models show a 65 percent increase in tool-call turns relative to their base versions. Frontier models achieve only 80.2 percent accuracy at detecting when a tool is genuinely irrelevant, meaning nearly one in five tool calls is spurious. Among open-source models, the Llama series is hit hardest: invoking tools on simple internal-knowledge problems incurs a 49.15 percent performance penalty. The tool actively damages the answer.
The practical consequence matches what practitioners are seeing in production. Serhii P, an AI agent developer, described the pattern in a widely shared post on DEV Community eight days ago — engineers building elaborate machinery around the model, custom orchestration layers, hand-rolled retry logic, and massive tool routing systems, all to solve problems the LLM was already solving if you just let it. The routing infrastructure is not wrong. But it treats a symptom while the cause — a training signal that never encoded strategy costs — remains embedded in every model fine-tuned with RLVR.
The paper proposes two fixes. Knowledge-aware direct preference optimization teaches models to distinguish between questions they can answer internally and questions that genuinely require external information. Applied to a 32-billion-parameter model, it reduced unnecessary tool calls by 82.8 percent while improving accuracy by 3 percent. A second approach using balanced reward signals, penalizing unnecessary tool calls regardless of whether the final answer was correct, cut unnecessary calls by 60 to 67 percent without accuracy loss.
Both fixes require retraining from scratch. Neither has been independently tested outside the authors' benchmarks. The experiments run on math reasoning tasks, and whether the findings generalize to real-world agent systems that combine code execution, web search, and API calls in unpredictable sequences is an open question. Retraining is expensive and disruptive for teams that have already shipped agent products.
What is not an open question is where the failure originates. Every agent deployed today that was fine-tuned with reinforcement learning from verified rewards runs on a training signal that never encoded strategy costs. Better routing infrastructure cannot fix it. The gap is structural.