OpenAI built a 750-task benchmark for AI drug-discovery research. Its best model scored 36%, and the same company designed the test, set the rubric, and published the results.
LifeSciBench is OpenAI's new benchmark for the kind of open-ended work bench scientists actually do: evidence handling, experimental design, analysis, validation, scientific reasoning, and translation or communication, across seven biological domains. Tasks are backed by 1,062 supporting artifacts (figures, PDFs, tables, genomic sequences, molecular structures, chemical files, web references) and were written by 173 scientists with PhD-level training and biotech or pharma industry experience. Lab Critics' third-party breakdown reports 19,020 individual rubric criteria across the suite, with 453 expert reviewers, 97% holding PhDs and averaging 12 years of experience and 14 publications. A 90% reviewer agreement threshold gates task acceptance, and each task averaged six automated review cycles plus at least two expert review rounds.
The methodology is unusually rigorous. The incentive structure runs the other direction. OpenAI also contracted the task authors, distributes the evaluated model (branded as GPT-Rosalind, likely a nod to crystallographer Rosalind Franklin) through a trusted-access research preview, and keeps independent reproduction gated behind that preview access. Claude models were not included in OpenAI's reported comparison, even though Lab Critics' per-task analysis suggests they are competitive on similar work. In the third-party scoring reported there, GPT-Rosalind led 386 of 750 tasks, Gemini 3.1 Pro led 214, and Grok 4.3 was also evaluated. The canonical public comparison becomes the comparison OpenAI chose to publish.
The headline number is honest in a way benchmarks usually are not. OpenAI's release explicitly cautions that LifeSciBench measures self-contained research tasks and does not establish whether AI accelerates drug discovery or improves research outcomes. That caveat, written by OpenAI itself, is the most useful sentence in the package. A model that passes 36% of self-contained research tasks has not yet shown it can move a real lab forward.
The harder pattern is what the models got wrong. EdTech Innovation Hub and OpenAI's own data both flag that performance dropped further on tasks requiring experimental design, exact calculation, and interpretation of external artifacts. Those are the steps a bench scientist spends the most time on, and the ones a useful research assistant would need to handle. A benchmark that scores well on lookup and poorly on the work the human does anyway measures one slice of the work, not the work.
StartupHub.ai and DistilInfo treat LifeSciBench as a launch announcement, with the 36% gap as the headline. The analytical question is whether the field accepts OpenAI's rubric as the canonical definition of "AI doing real science," or whether independent labs build a comparable benchmark with a wider model pool, full reproduction rights, and an explicit test on real-world research environments.
OpenAI says the next step is evaluating how models perform inside live research workflows, not just on self-contained tasks. That is the right test. It is not the test LifeSciBench currently runs.