Anthropic built a benchmark it can win. The hard problems tell a different story.
Anthropic published a benchmark this week that it says shows Claude solving problems human experts could not. The number being quoted is 30 percent. The number that is not being quoted very much is 44 percent.
BioMysteryBench, described in a post by researcher Brianna on Anthropic's website Tuesday, tasks Claude with 99 computational biology problems drawn from real bioinformatics datasets. Seventy-six of those problems were solvable by at least one of up to five domain experts who attempted them. Twenty-three were not. On those 23 hard problems, Claude Mythos Preview, Anthropic's unreleased frontier model, achieved roughly a 30 percent solve rate.
The headline wrote itself. What the blog post also contains, buried in the reliability analysis, is the finding that on the hardest tier, 44 percent of the model's successful answers came from only one or two correct attempts out of five. On the human-solvable problems, that rate was 9 percent. Anthropic frames this as the model sometimes stumbling onto a solution rather than following a reproducible strategy. That is the company's own language.
The methodology is the part worth paying attention to. BioMysteryBench derives ground truth not from expert conclusions but from controllable properties of the data itself or orthogonally validated metadata. A question like which organism a crystal structure belongs to has an objective answer, verifiable independently of whether any scientist knew it. This design choice matters because it allows the benchmark to generate problems humans cannot solve, which is the only way to test whether a model has genuine superhuman scientific reasoning rather than just matching the approaches human scientists happened to try.
The approach has independent backing. Genentech and Roche developed CompBioBench concurrently and found comparable results, according to The Decoder. Both projects arrived at similar conclusions about frontier model capability in computational biology without coordinating, which is the closest thing to replication these benchmarks ever get.
The timing is not neutral. The same week Anthropic published BioMysteryBench, the New York Times published transcripts of frontier chatbots including Claude walking biosecurity experts through pathogen synthesis and dispersal. Stanford microbiologist David Relman and MIT geneticist Kevin Esvelt, among others, assessed the outputs as remarkably creative and realistic. Britain's Centre for Long-Term Resilience called existing safeguards like a flimsy wooden fence. Anthropic's safety lead told the Times there is an enormous difference between a model producing plausible-sounding text and giving someone what they need to act. The company has set aggressive refusal thresholds for biological prompts. Those are reasonable positions. But BioMysteryBench is simultaneously an argument that Claude has genuine scientific reasoning capability in biology, not just plausible-sounding text. The reliability data complicates both claims at once.
What the 30 percent solve rate means in practice depends on what you think a benchmark is for. If it is a ceiling indicator, showing what is theoretically possible, 30 percent on genuinely hard problems with a novel methodology is notable. If it is a reliability claim, 44 percent brittle wins on the hardest tier means a researcher who handed this system a hard biology problem would have roughly a one-in-three chance of getting a correct answer on the first try, and that answer might not replicate.
Drug discovery pipelines require reproducibility. A model that solves a problem once in five attempts is not yet a collaborator in the sense a pharmaceutical lab would use the word. It is closer to a very well-read oracle that occasionally says something nobody else could have thought of, and more often says something that sounds right but is wrong.
The biosecurity question compounds the capability question. If frontier AI can solve hard biology problems at all, even unreliably, the dual-use risk that safety institutes have spent years trying to quantify is not a theoretical construct. It is a present fact. The 44 percent brittle-wins rate does not reduce that risk. It describes a model that sometimes succeeds in ways its operators cannot explain or reproduce.
Anthropic made two arguments in the same week. One is that Claude has genuine scientific reasoning capability in computational biology. The other is that the safeguards on biological misuse are robust. The benchmark data speaks to the first argument but does not fully vindicate the second.
BioMysteryBench is available on Hugging Face. The dataset includes validation notebooks for each question. The 23 hard problems and what they actually asked are not yet widely examined by independent researchers, which is the natural next step for anyone trying to assess whether the 30 percent figure represents a genuine capability milestone or an interesting ceiling that will take years to raise.