For the first time, the United States has an integrated health database that is both the largest in the world and built from the start to represent the country's actual population. The National Institutes of Health said on June 30 that its All of Us Research Program, a federally funded effort to build a precision-medicine cohort of one million Americans, has crossed 747,000 participants, including more than 535,000 whole-genome sequences (complete readouts of a participant's DNA) linked to electronic health records. The data is now being released to qualified researchers at no cost.
That combination matters because the previous flagship of large-scale population genomics, Britain's UK Biobank, recruited around 500,000 participants in the mid-2000s from a pool that skewed heavily white and urban, a demographic limitation that has shaped how researchers interpret UK Biobank findings for non-white populations. Scale unlocks the statistical power to study rare variants and small subpopulations; representativeness determines whether findings generalize to the patients who actually walk into American clinics. All of Us is now larger than the UK Biobank, and roughly half of its enrollees come from racial and ethnic minority groups, with all 50 states represented. That is the practical unlock for the equity-in-precision-medicine promise the program has carried since Congress authorized it in 2016.
The dataset, as NIH announced on Tuesday, includes about 482,000 linked electronic health records, 600,000 physical measurements, and 747,000 survey responses, on top of the genome sequences. NIH Director Dr. Jay Bhattacharya, in the agency's release, framed the milestone as enabling large-population pattern detection "across genetics, lifestyle, and environment," the central promise of precision medicine, or treatment tailored to an individual's biology and circumstances rather than averages drawn from non-representative samples.
Researchers can request access through the program's data portal, and the resource is already paying scientific dividends. The program counts more than 1,400 peer-reviewed publications built on All of Us data, including a 2024 Nature paper on the genomic data release itself and a 2026 Nature Medicine paper describing a wearables dataset that lets researchers study how continuous activity, heart rate, and sleep signals track with health outcomes. A recent All of Us-supported study used that activity data to identify patterns of physical activity associated with lower reported pain, a small example of the kind of phenotype-by-genotype work the cohort was built to enable.
"The All of Us dataset is a national treasure," Dr. Josh Denny, the program's CEO, said in the NIH release, "and we are committed to ensuring it remains a trusted resource for researchers and participants alike." Bhattacharya, in the same announcement, called the scale of the program a "paradox of plenty": enough data to find real signals, but enough diversity that those signals have to be carefully checked across populations that previous flagship studies missed.
The current announcement is a milestone in a multi-year roadmap rather than a one-off launch. According to GenomeWeb coverage of the program's plans, NIH intends to expand pediatric enrollment and release a new tranche of data later in 2026. Those are forward-looking commitments, not realized scale, and the pace of pediatric and rare-disease recruitment will be the metric to watch. A 2024 year-in-review piece in the American Journal of Human Genetics walks through governance and consent questions that NIH will need to keep answering as the dataset grows.
The announcement also lands as researchers push to integrate genomic data into the multimodal AI models being trained on electronic health records. A recent arXiv preprint, which has not yet been peer-reviewed, explores how whole-genome sequences can be folded into foundation models that already digest clinical notes, labs, and imaging. Datasets of All of Us' size and depth are exactly the substrate those efforts are racing to consume.
Several caveats belong in the record. The dataset is built partly on self-reported surveys and on electronic health records that were originally collected for billing rather than research, which limits the resolution of lifestyle and exposure variables. The program has also faced consent-related criticism since its inception, most prominently over how broadly participants agree to future use of their data, including by commercial partners. And the superlative itself is NIH's framing. Britain's Genomics England and Finland's FinnGen project pursue different models that compete on linkage depth and national-coverage breadth rather than raw participant count. The honest reading is that All of Us now sets the benchmark on size-and-representativeness together, not on either dimension alone.