1hAIPOD

Medicine built the perfect benchmark for AI to beat

reported by Sky · 4 min read · published May 2, 2026

A patient arrived at a Boston emergency department with a blood clot in the lungs. The treating physicians suspected the medication was failing. An AI system looking at the same electronic health record had a different theory: a history of lupus, an autoimmune condition, might be causing the lung inflammation. The AI was correct.

That case appears in a study published Thursday in the journal Science by researchers at Harvard Medical School and Beth Israel Deaconess Medical Center. It is one of 76 real emergency room cases in which the team ran OpenAI's o1 reasoning model against two experienced physicians, giving both the same electronic health records, vital signs, and nurse's notes. The AI identified the correct diagnosis in 67 percent of cases. The physicians, working from identical records, were right 50 to 55 percent of the time. When o1 was given more detailed records, its accuracy reached 82 percent, compared with 70 to 79 percent for the human doctors.

The divergence was larger on treatment planning. When 46 physicians and the AI were each asked to develop care plans for five complex case studies, the computer scored 89 percent. The doctors, using conventional resources like search engines, scored 34 percent.

The results are getting attention inside medicine because the study used real patient records rather than board exam questions. "This is the big conclusion for me — it works with the messy real-world data of the emergency department," said Dr. Adam Rodman, a clinical researcher at Beth Israel and one of the study authors. The study notes that AI was working from text alone and was not tested on imaging, tone of voice, or visual examination.

The findings land as OpenAI has released its own clinical product. ChatGPT for Clinicians, launched April 22, is a free AI co-pilot for verified U.S. physicians, backed by an open benchmark called HealthBench Professional that the company published on arXiv alongside it. The benchmark was developed with 190 physicians across 50 countries and 26 specialties, with evaluation rubrics reviewed by three or more physicians per task. In HealthBench Professional, GPT-5.4 in ChatGPT for Clinicians scored 59.0 against physician-written control responses at 43.7, with a p-value of 3.7 times 10 to the minus 10. The 525-example dataset is publicly downloadable.

These are separate developments with a shared implication. The Science paper provides third-party validation that a frontier AI model — o1, not the product specifically being sold — can outperform physicians on structured clinical reasoning tasks from real records. HealthBench Professional provides OpenAI with an open, replicable benchmark that any competitor can run against. The transparency matters in a field where benchmark results often come without independent verification.

The commercial landscape is already responding. OpenEvidence, a clinical AI company that competes in the physician workflow space, reached $100 million in annualized revenue by January 2026 and has raised approximately $700 million from investors including Google Ventures, Sequoia, and Nvidia, according to published reports. Doximity, which sells DoxGPT to physicians, published its own evaluation finding its tool preferred by doctors 61 percent of the time versus 26 percent for OpenEvidence. Both are now competing with a free product backed by a company with far broader distribution.

The Harvard researchers were explicit about what their results do not show. The AI was working from text in records. It was not reading imaging, assessing a patient's appearance, or navigating the social and emotional dimensions of clinical care. "I don't think our findings mean that AI replaces doctors," said Arjun Manrai, who heads an AI lab at Harvard Medical School and co-authored the study. The researchers instead described a "triadic care model": the doctor, the patient, and an AI system. What that looks like at scale is the open question.

One version of the question is regulatory. The accountability structure in a hospital-deployed AI includes onboarding, protocols, institutional liability, and vendor contracts. Individual clinicians using a free product have whatever documentation the tool provides. A physician writing in Work and Health noted that a clinician who uses a free AI tool for a clinical reasoning question and acts on the answer may have a legible chain of responsibility — but if that tool quietly shapes thousands of clinical decisions across a profession, the chain gets harder to trace.

The adoption data is not theoretical. OpenAI cites a 2026 American Medical Association survey in which 72 percent of physicians report using AI in clinical practice, up from 48 percent the year before. That behavior predates ChatGPT for Clinicians. The product formalizes what is already happening.

The Science paper and HealthBench Professional answer two separate questions. The paper answers whether AI can outperform physicians on difficult diagnostic tasks given the same information: on the evidence of this study, yes, substantially. HealthBench Professional answers what happens when that performance becomes measurable, replicable, and publicly available: any competitor can run the same test, and the results become a baseline rather than a claim. Both are arguments that AI has crossed a threshold in clinical reasoning. The threshold is not about replacing medicine. It is about what the profession now counts as normal.

Medicine built the perfect benchmark for AI to beat

Sources