For most of the past several years, the question of whether a song trained a commercial AI has lived inside a black box. Artists and rights-holders could suspect infringement, but the evidence sat inside labs' private training corpora, accessible only through discovery in litigation. A new searchable database from The Atlantic, built by reporter Alex Reisner, flips that dynamic by indexing four music datasets used to train AI models and making them queryable by anyone with a browser.
According to The Verge's account of the project, two of the datasets are very large, with roughly 12 million and 9 million tracks, and two are smaller, each holding more than 100,000 songs. The datasets are freely downloadable and have been pulled thousands of times. Google and Stability have both confirmed that they used them.
What changes is the asymmetry of access. A rights-holder who suspected a model had been trained on a specific song once depended on a court order, an academic paper that named a few examples, or a leak. Now an artist, a label, or a lawyer can search the index, confirm membership in a training set that has been independently used to build commercial AI, and act on that fact without the lab's cooperation. The training corpus becomes a permanent, citable artifact on the open web.
The disclosure lands inside an active copyright fight over generative audio. The same datasets that any music fan can now search sit at the heart of an ongoing debate about whether AI labs had the right to train on them in the first place. By publishing a queryable index, The Atlantic has effectively converted those internal training logs into a public resource for plaintiffs, for academic researchers auditing model provenance, and for downstream builders comparing the base corpora of competing models.
There is a second-order effect the database makes possible. If a rights-holder can now confirm a work's presence in a training set, the same query is also open to anyone with a stake in the set: catalog owners checking what flowed into a rival's output, downstream builders comparing base corpora before they build on top of them, and the next round of entrants deciding which corpora are worth the legal exposure of downloading.
The Verge reports that The Atlantic built the tool as a public resource for artists, labels, and rights-holders to verify inclusion in training data. Whether the index changes the outcome of the underlying debate is a separate question. It does, however, make the underlying question much easier to ask.