Anthropic flagged the paper internally before it went public. Fortune reported the company quietly circulated the research to government officials, a person familiar with the matter told TechCrunch. The company declined to comment.
The paper's core finding is that frontier multimodal AI systems routinely produce confident, detailed descriptions of visual inputs they were never given — a failure mode the Stanford team calls "mirage reasoning." On a benchmark called Phantom-0, every tested model generated fabricated visual details more than 60 percent of the time. The medical cases are where the mirage turns dangerous. Across five imaging categories — chest X-ray, brain MRI, pathology, cardiology, and dermatology — phantom diagnoses skewed toward acute, high-acuity conditions. ST-elevation myocardial infarction, melanoma, and carcinoma appeared disproportionately often in fabricated diagnoses. These are exactly the cases where a confident wrong answer is worse than no answer: a missed STEMI kills. A fabricated melanoma triggers an invasive biopsy.
The Qwen-2.5 result is the most striking data point. Fine-tuned on the public training set of ReXVQA — the largest chest X-ray visual question-answering benchmark — the text-only 3 billion parameter model ranked first on the held-out test set, beating all frontier multimodal competitors and surpassing human radiologists by more than 10 percent on average. It generated plausible explanations for its answers, indistinguishable from human-written ground truth, with no visual input whatsoever.
The mechanism is more revealing than the word "hallucination" suggests. Visual question-answering benchmarks contain textual regularities — the phrasing of questions, the distribution of answers — strong enough to determine correct responses without any image processing. A model trained on millions of image-text pairs learns to recognize the statistical signature of a chest X-ray question and produces the most likely answer. The test does not require vision. It requires having seen similar questions before.
The Stanford team — led by Mohammad Asadi, electrical engineering, and Jack W. O'Sullivan, cardiology and biomedical data science, with Fei-Fei Li and Euan Ashley as senior authors — proposes B-Clean as a solution: a benchmark designed to eliminate textual cues that enable non-visual inference. The paper (arXiv:2603.21687) was posted March 23 and revised March 26.
Anthropic's internal flagging of the research before public release is notable. The company has staked significant reputation on AI safety and has an explicit responsible scaling policy that includes triggering internal review when models reach capability thresholds. Whether this paper triggered any of those thresholds is unknown — Anthropic declined to comment on internal review processes. But the company's willingness to flag the research to government officials suggests it considers the implications serious enough to brief regulators proactively. That is unusual. Lab-initiated briefings on capability risks, rather than post-incident disclosures, are not standard practice across the industry.
The deeper problem is structural. Visual reasoning benchmarks were designed to measure whether models can see. They measure something different: whether models have memorized the textual patterns associated with images. Fixing the benchmarks is tractable — B-Clean is a genuine contribution. But the underlying vulnerability is harder to fix. If a model can answer questions about images it never saw, it can do so in deployment too, generating plausible-sounding observations about X-rays, satellite imagery, or document scans that were never processed. The model does not know it did not see the image. There is no internal signal that fires when the visual pipeline is absent. That is the mirage.