The Missed-Support Error: A New Reliability Lens for Agentic AI
A preprint inverts decision support theory, recasts humans and tools as a support layer for AI agents, and gives the field a formal way to count the failures that happen in silence.
A preprint inverts decision support theory, recasts humans and tools as a support layer for AI agents, and gives the field a formal way to count the failures that happen in silence.
Most agentic systems fail quietly. An AI agent that goes solo on a question where a human check or a tool call would have caught the mistake leaves no signal in the logs. That silence is a reliability problem the field has not learned to name. A new preprint, "Strategic Decision Support for AI Agents", takes a swing at it by inverting the classical decision-support picture and putting a formal error metric at the center.
For decades, decision-support research asked how to get humans to use model advice. The new framing starts from the opposite direction. In an agentic system, the AI is the principal actor; humans, tools, and other models are the support layer. The question is no longer whether a person will accept a recommendation. It is whether the agent will, on any given call, pull in the help it actually needs.
The paper names the failure mode: the missed-support error. That is the probability that an agent acts alone on an instance where invoking support would have materially improved the output. It is a counterfactual quantity, computed against what would have happened if help had been called, and it is exactly the kind of error that escapes standard accuracy metrics. A system can look competent on benchmarks while racking up missed-support errors that no one is counting.
Once the error is named, the paper recasts the design problem. The agent's policy should minimize support usage subject to a constraint on the missed-support error. That formulation is the agentic version of a classical cost-value tradeoff: every call to a human reviewer or a tool costs time, money, or context-window budget, and the design question is how to spend that budget where it pays off. At the population level, the optimal policy turns out to be a threshold rule: invoke support only when the expected value of doing so clears a fixed bar, and otherwise act alone. That result is a formal counterpart to the intuition behind human-in-the-loop escalation, except the bar is calibrated to an explicit error budget rather than to a vibe about confidence.
The practical contribution is an online algorithm that runs without distributional assumptions. The agent uses randomized exploration to estimate the value of support on the instances it is currently skipping, then folds that estimate into a calibration-on-the-fly step that tightens or loosens the threshold so missed-support error stays inside budget. The point of randomization is to escape the trap of never calling support on the very cases where it would have helped, which is the failure mode the framework exists to control.
The instantiation spans information gathering, tool use, and human-AI collaboration under one vocabulary, which is the part most likely to matter to builders. Instead of bolting separate "should I search?" or "should I escalate?" heuristics onto a pipeline, a developer can wire each support channel into the same threshold calculation and let the algorithm balance them.
A few caveats are worth carrying into any reading. The paper is a preprint, not peer-reviewed work, and the framework depends on the same counterfactual comparisons that any causal-style reliability argument would, so replication and empirical calibration in deployed systems remain open questions. The author and affiliation list were not part of the abstract fetched here, so the work should be cited as a framework proposal awaiting independent corroboration, not as a settled finding. The distribution-free design is a strength for deployment, but it also means the algorithm pays a learning cost up front, and the size of that cost in practice has not yet been published.
What to watch: whether agentic infrastructure vendors start exposing a missed-support rate as a first-class metric the way latency or cost-per-task is exposed today, and whether the threshold-rule picture survives contact with messier support channels, like human reviewers, where the value of a call is harder to estimate on the fly.