Even OpenAI's best model solves fewer than 3 in 10 real science tasks end-to-end.

Even OpenAI's best model solves fewer than 3 in 10 real science tasks end-to-end. — type0 | type0

PREVIEWEven OpenAI's best model solves fewer than 3 in 10 real science tasks end-to-end. · MD

The empirical ceiling arrived on the same Tuesday in late June 2026. OpenAI's GeneBench-Pro, a new 129-task end-to-end benchmark that grades real scientific work across ten areas from genomics to translational medicine, found that even GPT-5.6 Sol, OpenAI's most capable model, completes only 28.7% of those tasks when run at maximum reasoning effort and 31.5% in its slower Pro mode. The next-best non-OpenAI system, Claude Opus 4.8, reaches just 16.0%. For the first time, a frontier lab has published numbers showing that "better models" alone will not clear the bar for actual scientific research.

The same day, OpenAI's chief rival reached the same conclusion by a different route. Anthropic released Claude Science, an AI workbench for scientists that orchestrates existing models rather than shipping a new one. It plugs into more than 60 scientific databases, pre-installs domain toolkits for genomics, protein structure, and chemistry, and ships alongside a $30,000 credit grant program for 50 postdoc and graduate-student projects, with applications closing on 2026-07-15. The signal is not a better Claude. The signal is that Anthropic believes the binding constraint on AI for life-science research is no longer the underlying model.

A third major player has been making the same bet from a different angle. Google DeepMind's Gemini for Science wraps the company's proprietary scientific assets, including AlphaFold for protein structure and AlphaGenome for genomic regulation, into a platform built around 30 or more life-sciences databases. Where Anthropic is selling a workbench and OpenAI is selling a benchmark plus a closed model behind a gate, Google is selling the platform itself, with its own scientific models as the moat.

OpenAI's GeneBench-Pro paper names the failure pattern directly. It calls it the "notice-act gap": today's models can recognize anomalies and detect local diagnostic signals in scientific data, but they cannot translate that perception into the next methodological step. They spot a problem. They do not fix it. That gap is the reason none of the three strategies above can rely on the model to finish the job on its own.

This is why the model race has bifurcated into an ecosystem race. The strategic moves on and around June 30 are not random product launches. They are three incompatible responses to the same measurable wall. Anthropic is going wide: subscription distribution plus a workbench that calls external vertical models through the Model Context Protocol. OpenAI is going narrow: an open benchmark as the standard, a closed life-sciences-tuned model behind trusted-access review, and an enterprise gate that controls who gets to use it. Google is going deep: proprietary scientific assets become the platform itself. The three bets are not converging.

The critique underneath those bets is older than the products. Luomi Tech CEO Wu Hao, in trade-press analysis of the same-day launches, names three structural deficits that today's language models cannot easily escape in life-science work: they do not natively model the structure of raw biological data, they cannot tokenize stochastic phenomena like gene expression, and they cannot reason cleanly across the missing values that saturate real datasets. These are not prompt-engineering problems. They are reasons the notice-act gap will not close by scaling alone.

The economics of closing the gap from the human side are also sobering. GeneBench-Pro estimates that grading one question by a human expert can cost thousands of dollars, roughly $200 per hour across 20 to 40 hours of expert review, which makes expert scientific labor the binding constraint when models are unreliable. Whoever moves the model past the notice-act gap, or builds an ecosystem that routes around it, will determine whether AI for science becomes a working research tool or stays a demo layer over spreadsheets.

The June 30 moves also surface what is not yet decided. GPT-Rosalind, OpenAI's life-sciences-tuned model, shipped in April 2026 and lists Amgen, Moderna, the Allen Institute, and Thermo Fisher as deployments, with a Life Sciences Research Plugin bundling more than 50 tools and databases. The customers are real. How deep those deployments actually run in production is not publicly known, and access is gated behind OpenAI's safety review. A later capability update folded in stronger life-sciences intelligence through the LifeSciBench workflow, but the underlying access gate did not move. Anthropic's grant program is open until 2026-07-15 and may pull early-career scientists into Claude Science as a default workbench before any competing tool wins a foothold. Google's database count and bundling claims rely on trade-press analysis rather than a fresh primary press release.

What to watch next: any independent third-party reproduction of the 28.7% GeneBench-Pro pass rate on a non-OpenAI model, since GeneBench-Pro is OpenAI's own benchmark; whether Anthropic's workbench lets non-Anthropic models act as first-class tools or quietly routes everything through Claude; and whether DeepMind's Gemini for Science opens its AlphaFold-class assets to outside researchers or keeps them inside the walled garden. The model race did not end. It just stopped being the race that matters most.

Even OpenAI's best model solves fewer than 3 in 10 real science tasks end-to-end.

Sources