Strict Output Rules Can Make Open-Weight AI Agents Stop Calling Tools

PREVIEWStrict Output Rules Can Make Open-Weight AI Agents Stop Calling Tools · MD

A production-style AI agent can satisfy a strict JSON output check with near-perfect fidelity and still never call a tool. That paradox is the headline finding of a new preprint, arXiv:2606.25605, submitted on 2026-06-24 by independent researcher Aimin Zhang. The paper documents a reproducible failure mode that bites precisely where reliability teams would expect safety nets: at the joint of two constraints most agent pipelines treat as separable.

The setup is mundane, which is what makes it dangerous. An agent is told to (a) call a tool and (b) produce output that obeys a JSON Schema, a strict set of rules describing the shape of the response. Run the two constraints together, and across multiple openly downloadable model families the agent complies with the schema and quietly drops the tool call. Run either constraint on its own, and the same model calls the tool as expected, according to the paper's controlled experiments.

The technical explanation is implementation-level rather than a claim about model internals. Modern inference stacks compile JSON Schema constraints into grammar-based token masks, which restrict the model's output to tokens that satisfy the schema. If the tokens required to mark a tool call fall outside that mask, the decoder can produce a perfectly formatted response while the tool-call tokens are simply unreachable. The behavior is what the paper calls Tool Suppression, and the broader cost it imposes is the coined Constraint Tax. The author's evidence and figures are available in the preprint PDF and the arXiv HTML version.

The author does not stop at diagnosis. The paper proposes a behavioral hypothesis called Constraint Priority Inversion, or CPI: under multiple simultaneous constraints, schema satisfaction appears to dominate action selection. The author is explicit that CPI is consistent with the evidence rather than a verified internal mechanism, and the framing matters. Reading CPI as "the model decided not to call the tool" overstates what the experiments can show. The experiments show the tool-call path was unreachable in the mask, not that some learned preference suppressed it.

The constructive payoff is an inference-time fix. The paper introduces Transparent Two-Pass Execution, a pattern that decouples tool execution from schema-constrained response generation. The agent first runs the tool step under normal generation, then produces the schema-constrained reply against the tool's result. The author reports that the pattern restores tool invocation while preserving structured-output guarantees and requires no retraining (paper).

The stakes here are reliability rather than capability. Most agent benchmarks score tool use and structured output as separate axes, which means an agent that passes both can still be inert in production. The paper's broader claim, supported by its controlled setup, is that evaluating the two jointly is necessary for any honest reliability pipeline. That is a constructive critique rather than a doom framing: it gives teams a named, testable failure mode and a deployable pattern.

A few caveats bound the finding. The paper is a single-author v1 preprint with no peer review, and the abstract does not enumerate which open-weight model families were tested. The CPI hypothesis is explicitly a behavioral framing consistent with the evidence rather than a confirmed mechanism. The author announces code, data, and documentation at a GitHub repository that was not independently verified during research, so readers should treat the repository as promised rather than confirmed. The mitigation is also inference-time: it changes how an agent is wired, not how it is trained, so adoption is a deploy concern rather than a research breakthrough.

What to watch next: independent reproduction of Tool Suppression across the model families the abstract leaves unnamed, peer review of the CPI framing, and whether the broader agent stack ecosystem adopts joint-constraint evaluation as a default reliability check. The Constraint Tax is small in any single run, and that is exactly what makes it expensive in production.

Strict Output Rules Can Make Open-Weight AI Agents Stop Calling Tools — type0 | type0

Strict Output Rules Can Make Open-Weight AI Agents Stop Calling Tools

Sources