Healthcare AI Benchmarks Need Assumption Audits

Healthcare AI Benchmarks Need Assumption Audits — type0 | type0

PREVIEWHealthcare AI Benchmarks Need Assumption Audits · MD

A clinician would never write this prompt. A patient would never type it. Yet on a healthcare AI leaderboard this month, a model answered it well, and that score moved into a hospital procurement deck.

The question, drawn from the kind of expert-authored single-turn item that fills most clinical LLM benchmarks, asked for a structured differential diagnosis from a polished prompt with no follow-up and no clinician in the loop. Real clinical conversations do not look like that. Patients describe symptoms in fragments, ask the same question three different ways, and use language their doctors do not. The clinician, not the model, makes the call. None of those facts appear in the score.

A new position paper from the Carnegie Mellon Machine Learning Department argues that this is not a fluke. It is the predictable result of stacking five hidden assumptions into every healthcare AI benchmark, then publishing the aggregate score as if those assumptions did not exist. The paper labels the resulting mismatch the "Deployment Gap," the distance between what a benchmark actually measures and what a deployed system will actually face. In the CMU ML Blog post accompanying the paper, the authors treat the gap as a definition problem rather than a bug. Whoever controls the working definition of "accurate" controls the license to operate. The paper's contribution is to hand buyers, builders, and regulators the same five-question vocabulary to pull that assumption layer into the open.

The first assumption is query distribution. Most healthcare LLM benchmarks are built from prompts written by clinicians or domain experts, prompts that read like board exam questions rather than like what a frightened patient types at 2 a.m. If the test prompts do not resemble the production traffic, a high score does not transfer to the bedside.

The second is interaction type. Benchmarks are mostly single-turn: a prompt goes in, an answer comes out, the answer is graded. Real clinical use is multi-turn. A patient asks about a symptom, gets an answer, asks for clarification, supplies more context, and reframes the question. A model that scores well on isolated prompts can fail on the dialogue that follows them.

The third is decision mediation. In many evaluations the model's output is treated as a proxy for the clinician's decision, as if the model had decided anything. In practice a clinician reads the output, weighs it against the chart, and sometimes overrides it. Treating the model's draft answer as the decision inflates apparent performance and hides the cases where the clinician had to rescue the patient from the model.

The fourth is proxy outcome. Benchmarks grade against measured proxies such as annotator agreement, rubric scores, or accuracy on test items. These proxies are assumed to correlate with true health outcomes. The paper flags that assumption as unverified in most cases. A model can score high on a proxy and still be net-harmful at the bedside if the proxy is not a real outcome.

The fifth layer sits underneath the others: the assumption that "clinical task" and "clinical outcome" are the same thing. They are not. The right answer to a board-style prompt is not the same as the right action for a specific patient in a specific room, and a benchmark that does not distinguish between the two cannot tell buyers which model to trust.

These assumptions travel together. A single-turn, expert-authored, clinician-decision-stand-in, proxy-graded test rewards a particular kind of model, a model tuned to win that kind of test. Procurement officers who read only the leaderboard number have no way to see the assumption stack the number was built on, and the vendor has no incentive to publish it.

The paper's constructive move is to convert the five assumptions into five questions a buyer can put in writing. Who wrote the prompts, and do they resemble the queries this system will see in production. Is the test single-turn or multi-turn. Does the score reflect the model's output, or the clinician's decision after reading the model. What proxy was used, and has it been shown to track true outcomes. Does the benchmark distinguish task correctness from clinical outcome. Any one of these questions, asked in a procurement meeting, reframes the conversation from "what was the score" to "what was the score conditioned on."

That reframe is why the position paper is more useful as a procurement instrument than as a critique. A hospital AI committee does not need to reject vendor benchmarks wholesale. It needs to require assumption disclosure as a shipping requirement, the way clinical trials require pre-registration of endpoints before results are accepted. A benchmark that cannot answer the five questions does not deserve to clear procurement on the strength of its headline number.

The next test is whether the vocabulary travels. If five-question assumption audits become standard in how hospitals buy clinical AI, the leaderboard numbers that move procurement decks will look different within a year, and so will the models tuned to win them.

Healthcare AI Benchmarks Need Assumption Audits

Sources