AI startup Basecamp Research announces trillion-gene project
Basecamp Research's Billion-Gene Bet: The Data Moat That Actually Matters in AI Drug Design
There's a quiet competition happening beneath the headlines about AI models and drug discovery, and it has almost nothing to do with architecture. It has everything to do with data.
Basecamp Research, a Menlo Park-based AI biotech that has quietly assembled one of the most formidable biological datasets in the world, announced at SXSW and NVIDIA GTC this week the launch of the Trillion Gene Atlas — an initiative to collect genomic data from over 100 million species across thousands of sites globally, an expansion of known genetic diversity by 100-fold. The company already has EDEN foundation models — trained on more than 10 billion new-to-science genes from a million newly discovered species — that have demonstrated programmable gene insertion in human T cells and a 97% hit rate designing antimicrobial peptides against drug-resistant pathogens.
The partners in the Atlas are Anthropic, Ultima Genomics, PacBio, and NVIDIA. NVentures, NVIDIA's venture arm, has also invested in Basecamp's pre-Series C round.
The story that often gets lost in AI biotech coverage is that almost every current sequence-based foundation model is trained on variants of the same public repositories. Estimates vary, but the figures circulating in the field put roughly 80% of all training data for the major public models drawing from a database of fewer than 250 million sequences. Basecamp's proprietary BaseData dataset is currently more than 10 times larger than all public resources combined, according to the company, collected over six years from 152 partner organizations in 31 countries. That includes novel partnerships in Chile, Argentina, and Antarctica announced alongside the Atlas.
The critical question is whether data scale in genomics follows the same scaling laws that made large language models so powerful. Basecamp's EDEN models suggest it does. Their paper, co-authored with NVIDIA, Microsoft, and leading academics, describes new scaling laws where AI capabilities jump as datasets grow richer and more diverse — not just larger. CTO Phil Lorenz put it plainly: "Bigger models alone aren't enough. EDEN showed that performance in biological AI follows much steeper scaling trajectories with higher quality and fully contextualized data. The Trillion Gene Atlas extends that principle 100-fold."
On the therapeutic side, Basecamp's AI-Programmable Gene Insertion platform (aiPGI) has demonstrated insertion at over 10,000 disease-related locations in the human genome, including integration of cancer-fighting DNA into primary human T cells at novel safe-harbour sites, producing CAR-T cells that showed over 90% tumor-cell clearance in lab assays. The company is not yet in the clinic, but the data is moving fast.
One thing worth noting: Basecamp Research has no connection to the project management software company of the same name. The name collision is unfortunate; the science is not confused with it.
The competitive question is whether their data moat — built on global biodiversity partnerships, equitable access-and-benefit-sharing agreements, and off-grid sequencing technology — is defensible at scale. If the scaling laws hold, the answer is probably yes. The Trillion Gene Atlas, if it delivers, puts them years ahead of any competitor trying to assemble the same dataset from scratch.
Notebook: The ABS (Access and Benefit Sharing) framework Basecamp uses for its international biodiversity partnerships is quietly a significant competitive and ethical advantage. As DSI (Digital Sequence Information) regulations tighten globally, companies with existing treaty-level data-sharing agreements will be harder to replicate than their model weights. This is worth watching as a potential inflection point in how the genomics industry thinks about data ownership and equity.