Trillion Gene Atlas Expands Evolutionary Datasets for Next-Generation AI Therapeutics
Basecamp Research, with Anthropic, Ultima Genomics, and PacBio, is building a 100-million-species genomic dataset to train AI models that learn from evolution rather than from existing clinical data.

Basecamp Research Wants to Build a Trillion-Gene Foundation for AI Drug Design
The pharmaceutical industry has spent decades getting very good at designing drugs one at a time, for specific targets, with bespoke processes that don't easily transfer. Basecamp Research thinks that's about to stop being necessary.
The company, a frontier AI lab for biological design, announced Wednesday the launch of the Trillion Gene Atlas — an initiative to expand known evolutionary genetic diversity by 100-fold, collecting genomic data from over 100 million species across thousands of sites globally. The project, built in collaboration with Anthropic, Ultima Genomics, and PacBio and powered by NVIDIA infrastructure, was unveiled at SXSW and the NVIDIA GTC conference in San Jose.
The goal is to provide the training data that would let AI systems learn from evolution's existing solutions to design new medicines on demand — not just predict what a protein looks like, but generate working therapeutics from a disease prompt. The initiative follows the company's EDEN (Environmentally-Derived Evolutionary Network) foundation models, published as a preprint on bioRxiv in January.
The EDEN paper is worth dwelling on because it provides the scientific backbone for why Basecamp believes this approach will work. The largest model in the family has 28 billion parameters trained on 9.7 trillion nucleotide tokens — the full training data corpus. At the time of training, the BaseData database underlying it contained more than 10 billion novel genes from over 1 million species, according to the paper — more than 10 times the size of all public genomic resources combined.
The preprint describes three validation experiments that form the empirical case. In the first, EDEN was used to design large serine recombinases — enzymes that can insert DNA at specific genomic sites — prompted with as little as 30 nucleotides of DNA sequence. Across ten disease-associated genomic loci, the model achieved a 63.2% functional hit rate. Half of the generated recombinases were active in human cells, reaching therapeutically relevant levels of CAR gene insertion in primary human T cells. In the second experiment, the same model generated a focused library of antimicrobial peptides, 97% of which showed activity against critical-priority multidrug-resistant pathogens. The third involved designing a synthetic microbiome — 94,000 assemblies covering 9,067 species at 99% taxonomic accuracy — to test whether the model captures inter-genomic features.
"Today's biological AI models are trained on a narrow slice of life on Earth," said Glen Gowers, co-founder and CEO of Basecamp Research, speaking at SXSW. "Training models at this scale establishes a new paradigm for programmable therapeutic design."
The framing — that biology has been "data-starved" relative to language or vision — is one the field has been building toward for years. The Trillion Gene Atlas extends that thesis to what the company calls the "internet of biology," collecting from environments traditional labs can't easily reach, including new partnerships announced in Chile, Argentina, and an expanded collaboration in Antarctica.
The partnership with Anthropic is notable. Anthropic is best known for AI safety work and large language models, not genomics. The company didn't respond to questions about its specific role, but the involvement suggests Basecamp sees the AI capability — reasoning across complex biological data — as a core part of the value proposition, not just the sequencing hardware.
Processing the data at scale is a separate engineering challenge. The company estimates that processing the quadrillions of DNA base pairs involved would have taken over 20 years using previous methods. With NVIDIA Parabricks acceleration and parallelized processing, Basecamp expects to compress that to under two years.
There are honest questions worth asking. The EDEN paper is a preprint — not yet peer-reviewed — and several authors disclose financial interests in Basecamp. Wet-lab validation in the preprint involved relatively low numbers of experiments in some categories. The transition from designing recombinases and peptides to a full drug pipeline will require significantly more. The competitive landscape in AI biology is crowded: Recursion, Generate:Biomedicines, EvolutionaryScale, and others are all building foundation models for biology with different approaches and different datasets. Whether evolutionary data at scale from non-model organisms translates into a durable advantage is an open scientific question.
What's clearer is that Basecamp has built something real. A 97% hit rate on antimicrobial peptides in a low-N library is a meaningful wet-lab result, not marketing copy. The aiPGI result — designing functional recombinases from 30 nucleotides of sequence prompt — is a genuinely novel capability if it replicates. The bioRxiv preprint means the data is out for the scientific community to evaluate rather than resting on a company's own characterization.
The Trillion Gene Atlas is the next step: expanding the data layer that makes the models work. Whether it delivers on the scale it promises will take years to answer. But the bet the company is making — that evolutionary breadth is the limiting reagent in AI drug design — is one that a growing number of researchers and investors appear willing to back.

