12 Hours, 42,000 Lines of Code, Zero Fatigue
Can AI Agents Automate Scientific Discovery?\n\nThe AI scientist Kosmos runs for up to twelve hours, executing 42,000 lines of code and reading 1,500 papers before producing a scientific report. It operates without fatigue, without lunch breaks, and without the confirmation bias that plagues human researchers. The question is whether that is a feature or a bug.\n\nThree autonomous AI agent systems for life science research were on display at NVIDIA GTC 2026 in San Jose. Kosmos, from startup Edison Scientific, made seven discoveries in a recent technical run: three reproduced findings from preprints or unpublished manuscripts, and four made novel contributions to the scientific literature. Latent-Y, from Latent Labs, designed therapeutic nanobody candidates against nine targets and produced lab-confirmed binders for six of them, at single-digit nanomolar affinities, without human filtering. LabOS, developed by researchers at Stanford and Princeton, extended an earlier agent called CRISPR-GPT from text-based reasoning into a physical laboratory system that connects AI agents to smart glasses and robotic lab equipment.\n\nEdison Scientific is the commercial spinout of FutureHouse, a non-profit AI scientist initiative backed by former Google CEO Eric Schmidt and co-founded by Sam Rodriques, formerly a group leader at the Francis Crick Institute. Latent Labs is led by CEO Simon Kohl, who describes his system as a force multiplier that compresses what takes weeks into hours, with a 56-fold speedup cited in the same technical report. Both claims deserve scrutiny.\n\nThe numbers are real but narrow. Kosmos achieved 79.4 percent accuracy according to independent scientist evaluation of its generated reports, meaning roughly one in five statements it produced was wrong. Latent-Y succeeded on six of nine targets, a 67 percent hit rate, but nine targets is a small sample. Both systems were evaluated under conditions they designed themselves, which is standard in the field and not the same as an independent clinical validation.\n\nWhat makes these systems interesting is not their absolute performance but the context into which they arrive. A Nature survey, corroborated by a follow-up study in PLOS Biology, found that 70 percent of biomedical scientists cannot reproduce experiments from colleagues and 50 percent cannot reproduce their own work after a few months. These are not edge cases. They represent the baseline reality of how most life science research is conducted and published.\n\nMarinka Zitnik, an associate professor of biomedical informatics at Harvard Medical School, pointed out at GTC that 95 percent of all life sciences publications focus on just 5,000 of the most well-studied human genes. An AI agent that reads the existing literature will reproduce the biases of that literature, optimizing for the same pathways and proteins that already receive the most attention. The question of whether these agents inherit the field's blind spots or begin to correct them is not answered yet.\n\nRory Kelleher, Nvidia's senior director for global healthcare and life sciences, framed the stakes plainly at GTC: \"It is not that AI is going to replace scientists, but perhaps the scientists who use AI are going to phase out the ones who don't.\" That is probably true. It is also not the same as what the technology optimists imply. Automation does not automatically improve the quality of what it automates.\n\nThe task-length doubling rate matters here. METR, the organization behind Robоцкий and other open-source agent benchmarks, found that the length of tasks AI agents can reliably complete is doubling every seven months — a self-reported finding from their own blog and technical report, not a peer-reviewed result. That rate of improvement is why the companies at GTC felt confident announcing what they did. Whether the confidence is earned is a separate question.\n\nThe most honest version of this story is narrower than the press releases suggest. Kosmos, Latent-Y, and LabOS represent genuine progress in applying autonomous agents to experimental science. They can run more experiments, in parallel, faster than a human lab bench can manage. For routine tasks like literature synthesis and candidate ranking, that matters. But the reproducibility crisis is not a tooling problem. It is an incentive structure problem: scientists are rewarded for publishing novel results, not for verifying old ones. Faster publication does not fix that. It may accelerate it.\n