The most newsworthy claim in Genesis Molecular AI's paper on its new protein-ligand cofolding model isn't that it beats AlphaFold 3 on a benchmark. It's the founders' own accusation that the benchmark it won is broken.
In a Latent Space podcast this week, Genesis co-founder Evan Feinberg and CTO Sergey Edunov argued that the community benchmarks for molecular structure prediction are essentially calling AI slop good enough. That line, dropped in the same breath as the model release, is the more durable story, and it cuts against the product pitch Genesis is also selling.
Pearl, short for "Placing Every Atom in the Right Location," is a diffusion-based foundation model for predicting the 3D shape a small drug-like molecule will adopt when it docks with a protein. That docking pose is one of the central questions in early-stage drug discovery: get the geometry right and a medicinal chemist has somewhere to start. Get it wrong and everything downstream, from synthesis to assays to animal work, is wasted.
Genesis's claim, per the company's announcement and the arXiv paper it published with NVIDIA researchers, is that Pearl is the first model to clearly surpass AlphaFold 3 on protein-ligand complex prediction, by modeling the protein's flexibility rather than treating the binding pocket as rigid. That pedigree, AlphaFold 3 plus NVIDIA coauthorship, is the headline most outlets will write. It's also the headline Genesis wants.
What's more interesting is that Feinberg and Edunov, in the same conversation, argue the benchmark Pearl just won doesn't measure what the field claims it measures. The implicit argument: if the leaderboard rewards models for being "good enough" on easy cases, then topping the leaderboard clears a low bar, not a high one. The substantive win is whether a model can predict poses robust enough to drive real discovery workflows, including synthesis decisions, candidate ranking, and prioritization for experimental testing.
That critique echoes what other chemists have been saying from outside the vendor lane. Pat Walters, a veteran computational chemist who writes the Practical Cheminformatics newsletter, has been tracking the cofolding subfield and documenting the gap between benchmark performance and practical utility. Community benchmarks like the OpenBind Consortium's EV-A71 2A set are part of the effort to fix that, by publishing harder, more realistic test cases. But OpenBind is small, and Pearl's headline number still comes from the older, easier benchmark the founders are now criticizing.
The Llama angle, in which Edunov reportedly led Llama 2 training and Llama 3 pretraining at Meta before joining Genesis, per the Latent Space interview, is real, but it's also exactly the kind of pedigree claim that does the work of validation without providing it. The relevant question isn't whether Edunov can train a diffusion model. He can. The question is whether the field's evaluation methodology can tell good structure prediction from bad.
So what would have to surface for Pearl, or any of its competitors, to be treated as more than a vendor claim? Three things, roughly: a peer-reviewed third-party evaluation on the harder benchmarks (OpenBind and its successors); a named pharma partner willing to disclose retrospective or prospective results; or wet-lab confirmation that Pearl's top-ranked poses actually synthesize, bind, and behave the way the model predicts. None of those have appeared in the public reference set for Pearl as of this week.
Genesis is betting that the next generation of drug-discovery models will be built on diffusion primitives rather than the autoregressive architecture that dominates language modeling. That's a defensible technical bet, and the Pearl paper makes a real contribution to the literature on architectural choices for 3D molecular generation. Whether the field's benchmarks are now ready to tell us whether that contribution matters in a lab is a separate, and still open, question.