Beyond web scraping: Mozilla bets on community-owned AI data

PREVIEWBeyond web scraping: Mozilla bets on community-owned AI data · MD

The training data that built modern AI was not negotiated. It was scraped. Names, faces, voices, and sentences were taken from public corners of the web, fed into models, and monetized at a scale the original contributors never agreed to. The supply chain that produced the current generation of models is now under pressure from regulators and from the people whose information was taken, and the next chapter of how AI gets its data is being written in real time.

In November 2025, Mozilla Data Collective, a venture built by longtime Mozilla figure E.M. Lewis-Jong, set out to assemble what it calls a consent-first marketplace for AI training data. The pitch, as reported in a SiliconANGLE feature by Paul Gillin, is that the people and communities whose information trains a model should sit inside the supply chain rather than upstream of it, set the terms, and receive a share of the value that gets created. Lewis-Jong calls the data problem "big, structural" and the solution "structural." Whether that framing holds up depends on mechanics the launch has only sketched.

The most concrete piece of evidence is something Mozilla already runs. Common Voice, the foundation's long-running volunteer speech project, has collected contributions from more than 500,000 people across hundreds of languages and produced one of the largest public voice datasets in the world. It is the case study for the Collective's bet: contributors will give data when they have a voice in how it is used, and the resulting corpus can be large enough to matter. Lewis-Jong leans on it directly. The harder question is whether consent and compensation survive contact with a model-training market that has never paid for raw input at scale.

The Collective's model goes beyond Common Voice. Today the platform hosts hundreds of curated datasets representing more than 300 languages, including Hazargi literature from Afghanistan, oral histories in the Mada language from Cameroon, and Romansh newspapers from Switzerland — resources Lewis-Jong says would be difficult or impossible to find through conventional commercial data channels. Contributors can choose to share data openly, require attribution, limit use to educational or research purposes, restrict access geographically, or seek compensation. Those decisions belong to data creators rather than to an intermediary platform.

The organization operates as a "mission-locked British social enterprise," a structure Lewis-Jong describes as baking purpose into governance at multiple levels. "We exist to give communities ownership and agency over their data, and enable them to define and drive fair value exchange on their own terms," Lewis-Jong said. The structure is designed to avoid what the Collective sees as the limitations of both traditional nonprofits (which can struggle to build sustainable infrastructure at scale) and venture-backed startups (which face pressure to prioritize growth over community interests). The Collective measures success against a double bottom line: financial performance and mission stage gates. "If we don't hit our mission stage gates, we don't get to exist," Lewis-Jong said.

The Collective launched with a $10 million initial commitment from the Mozilla Foundation. It does not take a percentage of the fees communities choose to charge for their datasets; contributors receive the full amount, while downloaders pay a separate platform fee to cover infrastructure and operating costs. Lewis-Jong describes the platform less as a broker and more as a bridge — not necessarily competing directly with large data brokers that dominate AI training pipelines, but creating an alternative model that connects developers with communities historically overlooked by mainstream data markets.

That market is also under new pressure. Regulators in Europe, the United Kingdom, and parts of the United States have moved from abstract concern to enforcement around indiscriminate web scraping, biometric data, and copyrighted material. For a venture selling consent-based supply, the regulatory squeeze is not background. It is part of the sales pitch. A buyer worried about training-data provenance now has a reason to look at a curated pool with documented contributor agreements, and a venture that can deliver one has a wedge against the open web.

The Collective's design treats "fair value exchange" (compensation or other value returned to the people whose data trains the model) as the load-bearing term. Read narrowly, that means revenue share or per-dataset payment. Read broadly, it gestures at a structural shift in who owns the inputs to a model. Both readings are in tension with the way the AI industry currently buys data, which is to pay as little as possible for the largest possible pile and treat the question of consent as a legal footnote. The Collective's test is whether enough buyers will pay a premium for cleaner provenance to keep the marketplace solvent.

There are reasons to be skeptical. Data trusts — governance models in which contributors, not the platform, set the terms for how their data is used — have produced a long tail of pilots and a short list of survivors over the last decade. The data economics of large models reward scale, and a curated, consent-based supply chain will always cost more per example than an indiscriminate crawl. Independent partners, named buyers, and pilot results are the next data points that would either tighten or loosen the case. So far the Collective is leaning on its own description and on Mozilla's brand. That is a real asset, and it is not enough.

What to watch: whether the Collective announces named data providers, research partners, or commercial buyers beyond the Mozilla ecosystem; whether the compensation mechanic survives disclosure without collapsing into a small per-row payment; and whether the regulatory pressure that makes the pitch timely produces a market for it. The model that built the current generation of AI is being audited in public for the first time. Mozilla's bet is that the audit eventually becomes the procurement process, and that communities that helped build the dataset should be in the room when the bill comes due.

Beyond web scraping: Mozilla bets on community-owned AI data — type0 | type0

Beyond web scraping: Mozilla bets on community-owned AI data

Sources