Your LLM agent can surface the right tool name. It probably can't tell you what the tool actually does.
That is the rough shape of a finding from a team at SAP Labs, which built a diagnostic framework called ToolSense and ran it against five parametric tool-retrieval configurations on the ToolBench catalog of about 47,000 tools. The configurations that post healthy recall scores on the standard, fully-specified ToolBench queries collapse by 50 to 64 percentage points when the query is rewritten to look like something a real user would type. Several of them drop below a standard embedding retriever, the simpler baseline the parametric approach was meant to replace.
The point is not that the models are dumb. The point is that the benchmark, as commonly run, lets them pass without knowing what they are retrieving. The full paper, ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs, is on arXiv as preprint 2606.12451.
To make sense of the 50-point collapse, you need a one-paragraph translation of how tool retrieval for LLM agents actually works today. The dominant approach, called parametric tool retrieval and embodied in a 2025 system called ToolGen, treats each tool in a catalog as a virtual token stitched onto the end of the model's vocabulary. Training happens in two stages: first the model memorizes the tool tokens, then a retrieval-style fine-tuning pass teaches it to pick among them. At inference, a structure called a DisjunctiveTrie constrains the decoder's beam search so the output can only be valid sequences of those virtual tokens. That constraint, plus the verbosity of standard test queries, is what lets a model hit above 0.90 recall on the standard ToolBench G1, G2, and G3 splits.
The ToolSense authors, Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, and Sahil Bansal, argue that this setup rewards pattern matching more than tool knowledge. A model can learn to walk a token path that ends at the right name without ever internalizing what that tool does. If you then drop the trie constraint, as some agent fine-tuning recipes do to let the model call tools in natural language, the path-finding trick stops working, and so does the recall number. The HTML body of the paper walks through the diagnostic in more detail.
The diagnostic itself is straightforward. You hand ToolSense any tool catalog, and it auto-generates three benchmarks. The first, the Realistic Retrieval Benchmark, rewrites tool queries at three ambiguity tiers, ranging from clean and verbose to terse and missing key words, the kind a busy developer or end user would actually paste. The second is a multiple-choice probe on tool capabilities. The third is a question-answer probe on tool properties. The framework also ships a metric the authors call IS@k, the ratio of free-form recall to trie-constrained recall at cutoff k, that lets you see how much of a model's apparent tool knowledge depends on the decoder scaffold.
When the team ran the suite across five parametric training configurations on ToolBench, two patterns stood out. The first is the collapse: on the Realistic Retrieval Benchmark, several configurations lost 50 to 64 percentage points of recall relative to their fully-specified ToolBench scores and slid below a standard embedding-model retriever. The second is a dissociation: some of the same models that still surfaced the right tool on a fuzzy query then scored near random on the multiple-choice and QA probes, meaning they could name the tool but not explain it. In a follow-up comparison, the second-stage retrieval fine-tuning that the standard recipe uses almost universally destroyed the tool knowledge the first-stage memorization pass had built.
For a team shipping an agent today, the practical reading is grim but specific. If your evaluation pipeline runs against verbose, fully-specified queries with trie-constrained decoding in the loop, your recall number is partly a property of the scaffold, not the model. If you fine-tune that model further to call tools in natural prose, you are likely to lose tool knowledge that the scaffold was masking. The fix is to add an ambiguity tier and a free-decoding tier to your eval, and to check, on at least a sample, whether the model that picks the right tool can also answer what the tool does, what it returns, and what its inputs are.
ToolSense is open source. The framework, the three benchmarks, and the diagnostic metrics are released at github.com/SAP/toolsense, and the auto-generation step means the same audit can be run against an internal catalog this week, not a public benchmark. That matters because the paper's collapse numbers are reported on five training configurations of the authors' choosing, on a single catalog, ToolBench. Generalization to other catalogs, other model families, and other parametric schemes is not established by the paper itself, and the 50 to 64 percentage-point range is the authors' reported finding, not an independent benchmark.
ToolBench in its standard form is a vocabulary test, not a knowledge test, and the standard recall numbers are partly a property of the constrained decoder. ToolSense gives you a way to tell the difference on your own data, which is the only place the answer actually matters.