The robot that learned to do science
The robot that learned to do science.
When Yuri Tolkach and his team at University Hospital Cologne published their SPARK paper in Nature Medicine last week, the press release described it as an AI tool that could help pathologists diagnose cancer faster. That framing is accurate, and it is also almost entirely beside the point.
SPARK — System of Pathology Agents for Research and Knowledge — is not primarily a diagnostic tool. It is a research pipeline. It takes digitized pathology slides, generates biological hypotheses about what it sees in the tissue, writes executable code to test those hypotheses, runs the tests across thousands of patient samples, and produces validated biomarkers. The system worked without any additional model training. It used OpenAI o1 to generate ideas and Claude Sonnet from Anthropic to turn those ideas into working analysis scripts. Ninety-nine point two percent of what it wrote actually compiled. (Nature Medicine)
Eighteen cohorts. Five cancer types. More than 5,400 patients. The system proposed 618 distinct hypotheses about tumor biology and executed 2,368 parameter tests to validate them. After removing redundant correlations, it landed on 1,115 independent parameters with prognostic value. For microsatellite instability — a biomarker that determines which colon cancer patients respond to immunotherapy — SPARK achieved an AUROC of 0.933 in held-out test cohorts. That is a number that makes oncologists pay attention. (Nature Medicine)
The paper's most striking finding is not a diagnostic benchmark. It is a productivity observation: SPARK generated more complex ideas than the human pathologists it was tested against. When five practicing pathologists submitted their own hypotheses about tumor biology, their concepts were typically single-cell-type descriptions. SPARK's were multi-cell-type spatial hypotheses involving interactions between fibroblasts, neutrophils, macrophage positions, and tumor-stroma boundaries. The machine invented things the humans had not thought to look for. (Nature Medicine)
This is the industrialization pattern. Not AI as a tool that makes scientists faster. AI as a system that takes the bottleneck — the human hypothesis generator — and removes it.
The agentic stack that makes this work is worth examining on its own terms. Most agentic AI papers use a single model for everything. SPARK deliberately separates idea generation from code execution. OpenAI o1 was the idea engine — a reasoning model good at generating hypotheses from context. Claude Sonnet was the coding engine — selected after benchmarking for its ability to turn conceptual descriptions into working Python. The compilation rate served as the quality gate: if the code did not run, the idea was discarded. That is a surprisingly simple feedback loop, and it produced a 99.2 percent success rate. (Nature Medicine)
Whether this generalizes beyond Cologne is the right question. The paper validates on retrospective cohorts — strong evidence, but not the same as a clinical deployment. There is no regulatory filing, no FDA precertification pathway, no commercial product yet. The authors describe SPARK as a companion algorithm for drug development and patient selection, not a clinical device. That gap between research validation and clinical deployment is real, and it is large.
It is also, arguably, the least interesting thing about the paper.
The more important question is what it means that this pipeline exists at all. If a general agentic architecture can autonomously generate, code, and validate hypotheses from standard histopathology images, the economics of biomarker discovery change. Today, developing a new tissue biomarker takes years and costs millions — pathologists propose hypotheses, engineers build extraction algorithms, statisticians validate across cohorts. SPARK collapsed that loop. An academic lab with access to foundation model APIs and a computational pathology setup can now generate hundreds of validated hypotheses in a single run.
The companies with the most to answer are the ones that built their businesses on the old economics. Paige, Ibex, PathAI — the diagnostic AI companies that raised on the premise that AI could read pathology slides as well as or better than humans — are all, at their core, task-specific systems. They were trained to answer one question: is this cancer or not? SPARK answers a different question: what else is in here that we have not thought to look for? A system that can generate novel biomarkers from routine H&E slides is not competing on the same axis as a system that classifies known categories. It is a platform competing with point solutions.
Tempus AI acquired Paige in August 2025 for $81.25 million — a fraction of what diagnostic AI companies historically raised — raising questions about whether the incumbent business model can survive a generalist architecture that generates biomarkers from scratch rather than classifying preset categories.
There is a credentialing problem buried in this too, one that the paper raises but does not answer. When SPARK outperforms human pathologists at generating hypotheses about tumor biology, what does that mean for the humans who were trained to do exactly that work? The paper's human participants — five pathologists and a medical student — each produced prognostically significant ideas. But they produced simpler ones, and the system dwarfed them in volume and complexity. The authors describe the human contribution as valuable for targeted exploration, not general hypothesis generation. If that framing holds, the role of the expert in biomarker discovery shifts: not the source of ideas, but the curator and validator of machine-generated ones. That is a meaningful change in what expertise means in a research context. (Nature Medicine)
The paper does not discuss deployment, regulatory pathways, or commercialization. Corresponding author Tolkach runs a computational pathology lab at University Hospital Cologne (TolK Lab); co-author Fabian Eling is at the Technical University of Munich; Sajjad Sahin is listed with affiliations in Germany. The team has a track record in computational pathology — prior work includes GrandQC for tissue quality control and PATQUANT for lung cancer grading — but SPARK is a research system, not a product. What happens next depends on whether the framework gets open-sourced, whether a commercial partner picks it up, or whether it stays in the academic pipeline where the Cologne lab continues to improve it.
The honest assessment: this is a real paper with real numbers, published in a real journal, with a real architectural insight — that separating idea generation from code execution in an agentic pipeline produces a reliable quality gate. It is not vaporware. It is also not a product, and the gap between what SPARK demonstrated and what a clinical AI system requires should not be hand-waved away.
What it is, correctly understood, is an existence proof. Proof that the bottleneck in scientific discovery — the human who has to think of what to test — can be automated. Proof that a general agentic architecture can generate hypotheses from images, write code to test them, and validate the results across thousands of patients without task-specific training. Whether that generalizes to other domains, other data types, and other biological questions is the next five years of research. SPARK is the first published answer to that question, and it is an affirmative one.
The paper is Trost, F. et al. "An agentic framework for autonomous scientific discovery in cancer pathology." Nature Medicine (2026). DOI: 10.1038/s41591-026-04357-y.