Generative AI Improved How Clinicians Worked in Kenya. It Didn't Improve the Patients.
A 16 clinic, 9,600 patient randomized trial of a generative AI clinical decision support tool built on a model that scored 90.4% on U.S.
A 16 clinic, 9,600 patient randomized trial of a generative AI clinical decision support tool built on a model that scored 90.4% on U.S.
A generative-AI clinical decision-support tool that scored 90.4% on U.S. medical licensing exam-style questions produced better clinician notes and more accurate diagnoses, yet patients who used it got no better than patients who did not. The Kenya pragmatic trial is the clearest human-data answer yet to a question the $5.6 billion AI-healthcare investment wave assumed was settled: does benchmark accuracy translate to clinical effectiveness?
The result, published this week in Nature Medicine, lands in a field that has spent two years reading high medical-board scores as clinical readiness. It is the first large, pragmatic, cluster-randomized trial of a generative-AI clinical decision-support system (CDSS) in real primary care, and its primary endpoint came in flat.
In 16 clinics across Kenya and more than 9,600 patients, the tool known as AI Consult produced statistically significant gains in the quality of clinical documentation and in clinicians' decision-making process. The primary patient outcome was null: 2.2% of patients in the AI-assisted arm experienced worsening symptoms or required additional treatment within 14 days, versus 2.0% in the control arm. The difference was nowhere near statistical significance.
That asymmetry is the finding. The same model class, according to Clinical Trial Vanguard's analysis of the underlying system, scored 90.4% on U.S. medical licensing exam-style items, a number routinely cited as proof of clinical competence. The exam-style questions, by design, reward recall and pattern matching against an established textbook answer. The clinic rewards something else: the right diagnosis landing on the right patient at the right moment, through a referral pathway the system can actually deliver.
The Kenya setting helps explain the gap. Sub-Saharan Africa has roughly 0.3 physicians per 1,000 people, against an OECD average near 3.9 per 1,000. Most Kenyan primary care is delivered by clinical officers, mid-level practitioners trained on a three-year diploma rather than a medical degree. A cross-sectional survey of 112 health facilities in Homa Bay County found that 91% used the Kenya Electronic Medical Record system primarily for HIV care, a hint that the digital infrastructure an AI CDSS would plug into was not designed for general decision support. In low- and middle-income countries, roughly 60% of deaths from amenable conditions occur after the patient has already reached the health system, a reminder that the failure mode in these settings is care quality, not access.
Pharmacology already has a name for this kind of separation. Efficacy is what a treatment does under ideal, controlled conditions. Effectiveness is what it does in the real world, with real patients, real clinicians, and real infrastructure. The Kenya trial is, in effect, the first large-scale test of generative-AI CDSS effectiveness, and effectiveness, here, looked a lot like the control arm.
That distinction matters because of the money already chasing the assumption. AI-backed healthcare and biotech captured roughly $5.6 billion in investment in 2024, about three times the prior year, with benchmark performance widely read as a leading indicator of clinical deployment. The Kenya data do not say the technology is useless. The trial's secondary findings show real signal: clinicians made more accurate decisions, documentation improved, and the tool worked as a partner in the consultation room. What the data do say is that better clinician decisions, in this setting, did not translate into better healing at the two-week mark. Something between the decision and the patient, including referral pathways, drug availability, follow-up capacity, and patient comprehension, absorbed the gain.
That is the watch item now. The next round of trials will need to measure effectiveness, not efficacy: outcomes at the patient level, in the actual clinical workflow, against the actual infrastructure of the system being targeted. The University of Birmingham-led team running the Kenya trial has effectively redrawn the finish line. Sponsors, electronic-health-record vendors, and the investors writing checks against board-style scores should pay attention: a 90.4% on the U.S. medical licensing exam is a starting point, not a destination, and the Kenya numbers show how much of the journey is still ahead.