Three years ago, 38 percent of physicians used AI in their clinical work. Today, that number is 81 percent. The AMA's own survey data makes clear that AI has moved from experiment to routine in American medicine. What nobody has answered yet is what that shift means for the skills physicians spent years acquiring, or whether those skills still define what makes a good doctor.
OpenAI published an answer this week, and it is uncomfortable for medical educators. On a benchmark called HealthBench Professional, its GPT-5.4 model scored 59.0. Human physicians, given unlimited time and full web access, scored 43.7. The machine won. Clinical reasoning, the diagnostic skill that medical schools spend years building, is exactly what the test measures. OpenAI built the benchmark, ran the evaluation, and published the results itself. No independent lab has confirmed the scores, according to OpenAI's own blog post.
The adoption data gives the moment its weight. More than four in five U.S. physicians now use AI professionally, up from 38 percent in 2023, according to the AMA. Seventy percent describe it as a tool for tasks contributing to burnout, the administrative load that has nothing to do with clinical judgment. Seventy-six percent believe AI improves their ability to care for patients. The adoption numbers are moving faster than the policy frameworks designed to govern them.
OpenAI's move this week is a distribution play as much as a product one. Any verified U.S. physician, nurse practitioner, physician assistant, or pharmacist can now access ChatGPT for Clinicians at no charge, OpenAI said. Hospital procurement, IT approval, and enterprise contracting are slow. Giving the product to individual clinicians and letting it spread through clinical habit is a different kind of speed. ChatGPT for Clinicians includes documentation tools, literature review, care consultation templates, and prior authorization support. CME credits are available through research done in the product. The free tier extends OpenAI's footprint downward from enterprise accounts to individual practitioners.
The benchmark case is real but self-evaluated. The 11-point gap between the clinical workspace version of GPT-5.4 and the base version is not fully explained in OpenAI's documentation. Competitor scores, Claude Opus 4.7 at 47.0, Gemini 3.1 Pro at 43.8, and Grok 4.2 at 36.1, come from the same evaluation run. The independence problem is structural: OpenAI has an interest in the outcome.
Before launch, OpenAI's physician advisors reviewed more than 700,000 model responses and rated 99.6 percent as safe and accurate. Those advisors work with OpenAI. The BMJ published a peer-reviewed study finding that roughly half of AI chatbot responses to medical questions were highly problematic, with fabricated citations presented with false confidence, as Gary Marcus reported on his Substack. The two findings come from different methodologies, different contexts, and different incentive structures. The gap between them is not resolved by either study.
Nobody is currently tracking whether physicians who use AI extensively show changes in downstream clinical competence over time. There is no equivalent in medicine to the GPS-and-spatial-memory literature: no systematic data on whether AI-assisted diagnosis correlates with skill maintenance or skill decay as physicians age. Medical schools are beginning to grapple with how to train clinicians who will use AI as a matter of course. The competency-tracking infrastructure does not yet exist.
OpenAI is moving faster than that conversation. The free tier is available now. The benchmark number is published. The AMA adoption data tells you the market is ready. What the story is not yet settled is whether a machine that outperforms physicians on a standardized reasoning benchmark has changed anything about how medicine actually works, or whether it has simply revealed how much of what physicians do is administrative and how much of the rest is still theirs alone.