Seventeen radiologists from 12 hospitals in six countries sat down to review 264 chest X-rays. Half were real. Half were fabricated by AI. When asked to spot the fakes without being told the set contained any, the doctors identified deepfake X-rays correctly just 41 percent of the time — barely better than a coin flip. That is the finding at the center of a peer-reviewed study published March 24 in Radiology, and it should concern anyone who interacts with a hospital imaging infrastructure.
The study, "The Rise of Deepfake Medical Imaging: Radiologists Diagnostic Accuracy in Detecting ChatGPT-generated Radiographs" (DOI 10.1148/radiol.252094), was conducted by researchers led by Stanford radiologist Marcus Tordjman and represents the most systematic evaluation to date of whether AI-generated medical images can fool trained clinicians. The short answer: they can, and they did.
The experimental design was straightforward. Radiologists from the United States, France, Germany, Turkey, the United Kingdom, and the United Arab Emirates reviewed the image set without prior knowledge that synthetic images were included. Their spontaneous accuracy of 41 percent — essentially random performance — improved to 75 percent once they were told the dataset contained AI-generated images and asked to look for artifacts. But 75 percent is not a reassuring number when the stakes include a fabricated fracture submitted as evidence in a personal injury lawsuit, or a synthetic CT scan injected into a hospital PACS system to manipulate a diagnosis.
The deeper problem is that the AI did no better. When researchers tested a panel of frontier language models on the same image detection task, none performed reliably. GPT-4o achieved 85.1 percent accuracy on images it itself had generated, but only 75.5 percent on images created by RoentGen, the Stanford-developed open-source diffusion model that the study used for the bulk of its synthetic image generation. GPT-5 scored 82.5 percent and 89.1 percent on the same two categories. Gemini 2.5 Pro and Llama 4 Maverick both performed near chance, between 51.8 percent and 60.9 percent across image types. The model that made the fakes could not reliably catch them all — a limitation rooted in the fact that large language models have no memory and no built-in mechanism to watermark their own output.
We are potentially only seeing the tip of the iceberg, Tordjman told Reuters. The study used 2D chest X-rays because they represent the simplest synthesis target. The logical next step — AI generation of 3D CT scans and MRI images — is technically feasible and would raise the stakes considerably. A synthetic MRI could, in theory, be indistinguishable from a real one to a radiologist who has been reading them for decades.
What gives the current generation of synthetic X-rays away, when you know to look, is a catalog of tells that feel almost too obvious once named. Bones are overly smooth. Spines are unnaturally straight. Lungs are too symmetrical. Blood vessel patterns are excessively uniform. Fractures appear unusually clean and are often limited to one side of the bone. Musculoskeletal radiologists — those who read bone and joint images as a specialty — detected the deepfakes at 83 percent accuracy, compared with 70.3 percent for general radiologists, a difference that reached statistical significance. But here is the finding that should complicate any plan to simply train radiologists to spot AI-generated images: years of professional experience showed no correlation with detection accuracy. The radiologists who had been reading films for 40 years performed no better, on average, than residents in their first year. Experience had taught them what real pathology looks like. It had not taught them what AI-generated fake pathology looks like, because fake pathology did not exist until recently.
The open-source dimension matters here. RoentGen was developed by a team at Stanford Medicine led by Curtis Langlotz, a professor of radiology and biomedical informatics research, and Akshay Chaudhari, an assistant professor of radiology. The model was fine-tuned on chest X-rays and released publicly. That means the capability to generate realistic synthetic medical images is not locked inside a corporate research lab — it is freely available, right now, to anyone with the technical skill to run it. There was no metadata embedded in the images at the time of the study to distinguish them from real acquisitions. No digital watermark. No cryptographic chain of custody. AI has lowered the cost of fabricating medical truth to nearly nothing, wrote Bhayana and Krishna in an accompanying editorial in MedPage Today.
The cybersecurity implications are specific and serious. If a bad actor gained access to a hospital network and injected synthetic images into the PACS — the Picture Archiving and Communication System that stores and displays medical imaging — they could, in principle, create or obscure diagnoses for individual patients, manufacture evidence for litigation, or create clinical chaos across an entire institution. A fabricated fracture that looks real to a radiologist reading at 3 a.m. after a 14-hour shift is not a theoretical risk. It is an engineering problem that has been solved.
Tordjman tested his own creation on himself a few days before the paper was published. His accuracy was 85 to 90 percent. Even myself, I cannot tell which ones are real or fake for sure, he told the Boston Globe.
The primary study appears in Radiology, a peer-reviewed journal, and carries more methodological weight than a preprint or a blog post. It is worth noting, however, that the full paper is behind a paywall; this report has relied on secondary sources including Reuters, MedPage Today, ScienceDaily, Stanford Medicine's own coverage, and the Boston Globe for reporting of specific findings and quotes. Readers who want to evaluate the raw data directly should consult the original paper at DOI 10.1148/radiol.252094.
What the study does not answer is what comes next. The authors have documented that the vulnerability exists. They have not documented whether it has been exploited, how hospital imaging infrastructure might be hardened against synthetic image injection, or what a robust detection pipeline would need to include. Those are engineering questions that do not yet have engineering answers. The 41 percent figure is a snapshot of where radiologists stand today against a specific generation of synthesis tools. The next generation will be better.
The dual failure — expert humans fooled, frontier AI fooled, the tool that made the fakes unable to catch them reliably — is the finding that should stay with anyone building, funding, or regulating medical AI systems. The problem is not that radiologists are bad at their jobs. The problem is that the task has changed in ways their training did not anticipate, and the tool that changed it is open, distributed, and improving.