An NIH-backed research team has built a medical AI system designed to show its work — every citation, every evidence chain, every reasoning step — so that doctors and researchers can verify what the system found. The paper describing it was posted to arXiv last Thursday. It has 22 authors including researchers from the National Library of Medicine, introduces a benchmark of 100 expert-curated questions evaluated by eleven biomedical specialists, and claims the system outperforms existing production tools across multiple criteria. The system is called DeepER-Med.
The problem is the people meant to check it.
Medical AI that synthesizes literature is only as trustworthy as the humans reviewing it. But clinicians are already overwhelmed — and a system that produces more transparent output does not automatically produce output that anyone has time to review. The Journal of Medical Internet Research documented that failure mode earlier this year: AI systems in medical evidence synthesis were generating confident conclusions with unreliable citations and opaque reasoning chains that clinicians could not meaningfully evaluate even when the output was handed to them. DeepER-Med is a direct response to that problem.
That framing is both the paper's selling point and its awkward admission. The arXiv paper cites the JMIR analysis in its introduction as the motivation for building something inspectable. On eight real clinical cases reviewed by practicing clinicians, DeepER-Med's conclusions aligned with clinical recommendations in seven cases. In one case, they did not. The paper does not explain why.
Seven out of eight sounds like a passing grade. It is also a precise description of a gap between visibility and verifiability. DeepER-Med makes evidence chains visible. It does not make them verifiable at the pace or scale that clinical practice demands. The JMIR researchers noted the core issue: tools that generate more information without a clear mechanism for human review risk compounding the original problem rather than solving it. A system that shows its work is not automatically a system that anyone can act on.
The quantitative results showing exactly how much DeepER-Med outperforms existing tools on specific criteria are in the paper's tables, not the abstract. No independent researcher has yet replicated or evaluated the system outside the authors' own testing. The one case where the system failed to align with clinical recommendation is described but not analyzed. Each of those is a fact that belongs in a published article.
The broader pressure point is real and not unique to this paper. Medical AI is developing faster than the infrastructure to evaluate it. Benchmarks exist, but they rarely measure what clinicians actually need to trust a system: not just whether it reaches correct conclusions, but whether it can show its work in a way that a time-pressed doctor can verify. DeepER-Med moves in that direction. Whether anyone is positioned to follow through is a different question, and the paper does not answer it.