The Experiment That Proved Data Hoarding Was Bad Science

The Experiment That Proved Data Hoarding Was Bad Science — type0 | type0

PREVIEWThe Experiment That Proved Data Hoarding Was Bad Science · MD

The pharmaceutical industry has spent decades treating proprietary data as a fortress. The logic was simple: more private data meant better AI models, which meant faster drug discovery, which meant competitive advantage. Then ten of the world's largest drug companies ran an experiment to test that assumption. They lost.

The experiment was called MELLODDY, and its conclusion was uncomfortable enough that it is now the subject of a conference revival. At the Bio-IT World conference in Boston last month, a keynote panel returned to the finding for the first time in public since the peer-reviewed results landed in 2023: when competitors share AI model training through federated learning — keeping their raw data locked inside their own servers — they produce models that outperform anything they build alone. Not marginally. Consistently. Across every participant.

"The winning strategy in 2026 is federated learning," said Woody Sherman, panel chair and computational chemist, framing it simply: bringing the algorithm to the data, not the data to the algorithm. The audience at Bio-IT World — 2,900 biopharma and IT leaders — heard it as a technical observation. It was also a strategic inversion with consequences that reach well beyond any single conference.

The prisoners, minus the dilemma

MELLODDY ran from 2019 to 2022. Ten pharmaceutical companies — Bayer, Novartis, GSK, Janssen, AstraZeneca, Merck, Amgen, Boehringer Ingelheim, and Servier — agreed to train machine learning models on their combined data without any of them actually sharing it. The EU contributed €18 million in funding under the Innovative Medicines Initiative. The technical architecture was federated learning: each company trains a local model, and only encrypted model updates — not raw data — flow to a central aggregator. The global model improves without anyone's proprietary compound library ever leaving the building.

As Mathieu Galtier, project coordinator at Owkin, described it: "no confidential data was ever exposed" — the federated architecture was designed so that only encrypted model parameters left each company's servers.

The scale was unprecedented: 2.6 billion experimental data points, covering more than 20 million chemical compounds across 40,000 biological assays. The results, published in the Journal of Chemical Information and Modeling, showed the federated approach producing consistent gains across all ten partners. Classification of pharmacologically or toxicologically active molecules improved by 4 percent on average. The models' applicability domain — their ability to make confident predictions on new types of molecules — expanded by 10 percent. Toxicity and pharmacology estimations improved by 2 percent. Not one partner got worse. All ten got better.

The performance gap was most pronounced for assays relating to pharmacokinetics and toxicology — precisely the areas where proprietary data has historically been most guarded and most scarce.

What the moat was actually worth

The pharmaceutical industry's attachment to data as a competitive weapon has deep roots. Proprietary assay data, failed compound libraries, clinical trial negatives — this information represents years of expensive experimentation that companies understandably treat as trade secrets. The logic was that a large, private dataset was a prerequisite for a powerful AI model, and a powerful AI model was a prerequisite for competitive drug discovery.

MELLODDY challenged both steps. The federated model did not just protect proprietary data — it produced better science than any single participant could achieve on its own. The data moat, in other words, was not just a privacy risk. It was a scientific disadvantage.

The implications are not lost on the industry. Apheris, a Berlin-based platform founded in 2019, is now running live federated networks for pharmaceutical companies. The company was a visible presence at Bio-IT World this year. Federated learning infrastructure that began as an academic research exercise has matured into a commercial product category.

The limits of coopetition

It would be tidy to declare the data-hoarding era over. It is not. MELLODDY operated under conditions that commercial rivals may find harder to replicate. The consortium had EU funding, academic partners, and a defined endpoint. The participants were not optimizing for competitive advantage during the project — they were optimizing for a publishable result. Real commercial deployment across actual competitors with genuine IP incentives introduces friction that a coordinated research consortium can absorb.

The 4 percent classification improvement, while consistent, is also modest in absolute terms. Whether it justifies the operational overhead of maintaining federated infrastructure — the secure Kubernetes clusters, the governance frameworks, the legal agreements — at commercial scale remains an open question. A company spending eight figures annually on internal AI pipelines is not going to restructure because of a 4 percent gain.

The privacy guarantees of federated learning are also not absolute. Model updates can, under certain conditions, be inverted to approximate underlying training data. MELLODDY's architects addressed this with differential privacy protections, but the attack surface exists.

The precompetitive layer

What does seem established is the structural principle: AI infrastructure for drug discovery may be shifting toward a precompetitive layer, where base models are shared, and competitive advantage migrates to who can contribute the most useful data and who can extract the most value from shared predictions.

That is a familiar pattern from other technology transitions. Linux did not make Red Hat irrelevant — it made Linux ubiquitous and the applications built on top of it the competitive surface. Cloud computing changed which advantages mattered without eliminating enterprise software advantages. If federated learning becomes standard for foundational drug discovery AI, the companies that move earliest to participate in the networks, contribute highest-quality data, and build the most refined downstream interpretation layers will hold different advantages than those that invested most heavily in data hoarding.

The Bio-IT World keynote panel included speakers from Columbia University, Eli Lilly, Apheris, and Bayer. The moderator asked whether federated learning would actually move the needle in drug discovery. The answer from MELLODDY — delivered three years before the question was posed at a 2026 conference — was: yes. Now the industry has to decide what to do with that.

The Experiment That Proved Data Hoarding Was Bad Science

Sources