OpenAI's new test for AI scientists: 750 tasks, 173 real researchers, no multiple choice

PREVIEWOpenAI's new test for AI scientists: 750 tasks, 173 real researchers, no multiple choice · MD

OpenAI on Wednesday released a new test, LifeSciBench, designed to measure whether AI systems can handle the judgment work of life science research, not biology trivia, but the messy parts: interpreting incomplete evidence, designing experiments, deciding what to do next under uncertainty. The benchmark draws on 750 tasks written by 173 Ph.D.-level scientists and graded against 19,020 expert-written rubric criteria, according to OpenAI's announcement of the release.

What OpenAI is trying to escape is a generation of life-science AI tests that look rigorous on a leaderboard and break the moment a real researcher opens the result. Most prior evals, OpenAI argues, skew toward single domains, isolated skills, and clean reference answers. LifeSciBench is built around seven research workflows (evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication) and seven biological domains. The tasks come with attachments: figures, PDFs, tables, sequence files, structure and chemical files, and web references. OpenAI says 1,062 such artifacts are folded into the set, and 53% of tasks require interpreting or synthesizing at least one of them. Seventy-nine percent require multiple reasoning steps, with an average of four per task.

One example task surfaces what "judgment" means here in practice. The model is asked to critique a regulatory package for a Duchenne muscular dystrophy micro-dystrophin therapy being considered for an accelerated FDA pathway, and to flag the assay-specific failure modes a reviewer would worry about: a Western blot whose antibody might be cross-reacting with the wrong epitope, an immunofluorescence stain whose antibody binds the wrong end of the protein, the contested validity of the surrogate endpoint the sponsor is leaning on, and the design of the muscle biopsy that is supposed to demonstrate efficacy. The point is not whether the model gets the right answer. The point is whether it reasons about the experiment the way a translational scientist would. Grading, per OpenAI, runs against 19,020 rubric criteria, an average of 25 per task, written to evaluate scientific correctness, justification, caveats, and operational usefulness, not just the final line.

The benchmark's construction is the most serious part of the release, and also the part that is hardest for an outsider to verify from the announcement page alone. OpenAI says the 173 Ph.D.-level contributors all had biotech or pharma industry experience; 453 expert reviewers ran at least two rounds of review per accepted task; the inter-reviewer agreement threshold was set at 90%; and accepted tasks averaged six self-directed review cycles before they cleared. None of those numbers have been independently audited in the materials OpenAI published, and the company has not yet released a public leaderboard showing how specific models score on the set. The post links a paper, but the paper's methodology section, the part that would let an outside benchmark researcher poke at the rubric design and the inter-rater reliability, was not part of what OpenAI put on the launch page.

That gap matters, because vendor-authored benchmarks have a track record of measuring what their authors can grade rather than what matters most. Three structural questions will follow LifeSciBench into the field. First, does rubric-graded free response reward answers that sound like a scientist rather than answers that would actually advance a project; the Duchenne example suggests the designers are alert to that risk, but a rubric is only as good as the criteria it is built from. Second, the contributor pool is biotech and pharma drug discovery, which shapes what counts as "real research" inside the benchmark; basic-bench biology, academic lab work outside translation, ecology, plant science, and the long tail of fields where AI assistants are also being tested are thinner here. Third, the tasks are written scenarios, not live wet-lab work, so a strong LifeSciBench score tells a scientist something about how an AI reasons through a research problem on paper, and nothing about whether it can actually run the experiment, troubleshoot the assay, or survive a contamination event at 11 p.m. on a Tuesday.

What the benchmark is genuinely useful for is standardized visibility. A working scientist can use a LifeSciBench-style score to compare two models on the same translational critique, or to flag a model's weakness in assay design before trusting it on a real protocol. What it cannot do is settle the bigger question the field keeps asking, which is whether any AI system can do science. The interesting watch item is whether OpenAI, or an outside group using the same task set, publishes per-model scores with confidence intervals and a description of which rubric criteria most often fail, and whether independent benchmark researchers get access to the underlying rubric file. Until then, LifeSciBench is best read as a serious attempt to widen the question AI-in-science benchmarks ask, and a reminder that the people who fund a test are usually the people most invested in it looking good.

OpenAI's new test for AI scientists: 750 tasks, 173 real researchers, no multiple choice — type0 | type0

OpenAI's new test for AI scientists: 750 tasks, 173 real researchers, no multiple choice

Sources