Heather Kulik has spent years trying to teach AI to understand chemistry. The result is a precise map of where AI works in her field, and where it breaks down.
Kulik, a materials science researcher at MIT, has run the same benchmark test three times over the past year, with the same result every time. Large language models including Claude and ChatGPT can handle some chemistry tasks reliably. Kinase inhibitors — the kind of small molecule drugs that drug companies have been working on for decades — are within reach. The models pass. But when the benchmark turns to metal-organic frameworks, or MOFs — highly porous crystalline materials used for gas storage, catalysis, and drug delivery — the models fail consistently.
The 22-atom ligand benchmark is not a trick question. It is a real test of whether a model can reason about how atoms arrange themselves in three-dimensional space, and how that arrangement affects energetic stability. The models fail it on MOFs. They pass it on kinases.
The reason comes down to training data. We have really good datasets for really boring chemistry, Kulik said on the Latent Space podcast. Kinase inhibitors have been studied for decades. The protein structures are in the PDB. The training data is abundant and clean. MOFs do not have a PDB equivalent. The best ground truth available is density functional theory calculations — which is to say, computationally expensive simulations of quantum behavior, not measurements of actual molecules. The data is noisy, limited, and expensive to produce.
AlphaFold solved protein folding because the problem had specific features that made it solvable: 20 amino acids, well-characterized building blocks, enormous amounts of training data from decades of structural biology. Materials science does not have that. Each element in the periodic table introduces new chemistry that does not transfer to the next. There is no universal building block, no shared grammar. AlphaFold worked because biology converged on a small alphabet and then repeated it. Chemistry has not converged.
There are successes. Kulik group has designed materials using AI guidance that turned out to work — including one polymer that was four times tougher than expected, because the model had found a quantum mechanical effect that human chemists had not anticipated. That is real. It is also an isolated win, not a general capability.
The honest version of the story is not that AI cannot do materials science. It is that AI can do some materials science, in specific conditions, with abundant training data. The gaps are not incidental. They are structural. The benchmark that passes for kinases fails for MOFs because MOFs do not have enough training data, and the problem is not fixable without someone doing the expensive quantum chemistry calculations to generate that data. AlphaFold for materials requires a PDB for materials. Nobody has built it yet, and the incentives to do so are not obvious.
For people trying to use AI to discover new materials: know which category your problem falls into. Small molecules with decades of literature behind them are tractable. Novel structural materials with no large dataset are still the domain of human intuition, expensive simulation, and experimental trial. The gap between those two categories is the whole story of where AI actually stands in chemistry right now.