AI Labs Are Getting Their Models Stolen. Their Defenses Might Not Work.

AI Labs Are Getting Their Models Stolen. Their Defenses Might Not Work. — type0 | type0

PREVIEWAI Labs Are Getting Their Models Stolen. Their Defenses Might Not Work. · MD

Every major AI lab says it has defenses against model distillation: the technique where someone clones a rival's AI by peppering it with queries and training a copycat on the outputs. Google, OpenAI, and Anthropic have all disclosed their own antidistillation methods. They publish benchmark numbers. The numbers look good.

A paper posted to arXiv on May 21 by researchers at Stanford and EPFL suggests those numbers may not mean what the labs think they mean. The core finding: existing defenses are tested against passive students, the kind of attacker who takes the queries they're given and trains straightforwardly on the responses. Real-world distillation campaigns are adaptive. They change their strategy based on what works. When evaluated against that harder standard, the defenses look substantially weaker.

The paper frames this as a game of imperfect information: the defender doesn't know which queries are part of an attack, and the attacker doesn't know which defenses are deployed. The researchers model both sides using a minimax framework and derive a practical defense called Product-of-Experts, or PoE. Rather than requiring costly per-query computation or degrading the quality of outputs, PoE combines the original model's responses with a deliberately weaker proxy student during generation. The result, the researchers argue, is a forward-pass-only method that is substantially cheaper than alternatives while preserving the quality of reasoning traces that defenders care about.

The passive-adaptive gap is the key finding. Under passive evaluation, PoE appears inferior to more expensive defenses. Under adaptive evaluation, the gap shrinks considerably. "Strong distillation remains difficult to stop," the authors write, "and progress on antidistillation should be judged against adaptive students rather than passive ones."

The timing matters because the geopolitical stakes have escalated sharply. In February 2026, OpenAI sent a memo to the U.S. House Select Committee on the Chinese Communist Party accusing DeepSeek of using distillation to copy American AI models. Google separately disclosed that Gemini had been targeted by a single campaign involving more than 100,000 prompts. Anthropic identified industrial-scale extraction campaigns across three named labs, including MiniMax, which Anthropic said was responsible for more than 13 million exchanges with its Claude API. The U.S. private AI investment landscape in 2025 was $285.9 billion, compared to $12.4 billion in China, according to an analysis by the International Institute for Strategic Studies (IISS). Despite that investment gap, the compute advantage is not translating into unbeatable model protection.

The U.S. government has responded. In April 2026, the White House issued NSTM-4, a directive treating Chinese AI distillation attacks as a national security concern and directing federal agencies to treat them as such. A proposed law, the Deterring American AI Model Theft Act, would formalize the response by creating a public AI Model Extraction Attackers List and authorizing sanctions against identified actors. The IISS analysis notes that while the United States controls roughly 74 percent of global AI compute and China roughly 14 percent, China's stated goal of achieving AI parity by 2030 makes distillation an attractive shortcut.

John Hultquist, chief analyst at Google Threat Intelligence, said in February that the scale of the Gemini attack suggested a broader pattern. "We are going to be the canary in the coal mine for far more incidents," he told NBC News. Google classifies distillation as intellectual property theft.

The accountability question the paper raises is pointed. If the primary evaluation method for antidistillation defenses is testing them against passive students, and real attackers are adaptive, then labs that deployed those defenses may be more exposed than their published benchmarks indicate. The paper does not name which specific defenses it tests, and the full benchmark tables are not yet in a publicly available digest. The work, posted to arXiv on May 21, has not been peer-reviewed. Researchers who build defenses have an obvious incentive to test them against the easier benchmark, and their marketing reflects that choice.

The PoE approach, if it holds up under independent evaluation, would be a practical option for smaller AI developers who lack the compute budget of the frontier labs. The paper argues that the cost-performance tradeoff favors PoE over more expensive alternatives, particularly for teams that cannot afford to run full per-query adversarial filtering on every API call.

The real test will come when other researchers attempt to replicate the results. Labs whose defenses were tested in the paper will be asked to respond. The passive-adaptive gap, if it generalizes beyond the benchmarks in the paper, would be a structural problem for how the field evaluates model security. That is a high bar. It is also the question that matters most.

AI Labs Are Getting Their Models Stolen. Their Defenses Might Not Work.

Sources