The Data Shortcut: Why Some Chinese AI Labs Copy American Models

PREVIEWThe Data Shortcut: Why Some Chinese AI Labs Copy American Models · MD

When Anthropic published a security report in February naming three Chinese AI labs for using its models to train their own, the story quickly hardened into a national one: Chinese firms stealing American intelligence. The more useful version is structural. The three labs Anthropic named, DeepSeek, Moonshot, and MiniMax, were caught running roughly 16 million distillation exchanges against Claude across about 24,000 fraudulent accounts, according to the company's distillation-attack report. The four big Chinese AI labs Anthropic did not name, Alibaba's Qwen, ByteDance's Seed, Tencent's Hunyuan, and Xiaomi's Mimo, never appeared in the report, not because they were cleaner, but because they already had something DeepSeek, Moonshot, and MiniMax did not.

Adversarial distillation, in plain terms, is training your own model on a competitor's outputs. Instead of building capability from scratch on your own data, you let a frontier model generate answers and teach a smaller model to imitate them. The technique itself is widespread and, in some forms, legal. Anthropic's grievance with these three labs is that the exchanges were coordinated, fraudulent, and aimed at producing a direct competitor. That is a real complaint, and it is one the company has chosen to publicize.

The puzzle is why these three and not the others. Interconnect analyst Kevin Xu, who visited all three accused labs in the spring, argues in a recent essay on Chinese AI distillation that the answer is data access. Qwen lives inside Alibaba, Seed inside ByteDance, Hunyuan inside Tencent, and Mimo inside Xiaomi. Each of those parents runs a platform with hundreds of millions of daily users, and each one of them sits on a proprietary data pipeline that the model team can draw from: shopping and payments for Alibaba, video and creator behavior for ByteDance, messaging and gaming telemetry for Tencent, and mobile device interaction for Xiaomi. The data is not labeled, but it is real, in volume, and unique to the parent. In the West, only Apple and Google sit on anything comparable. Independent Chinese firms have no equivalent.

DeepSeek, Moonshot, and MiniMax do not. They are venture-backed or research-oriented outfits without a parent platform's proprietary data assets. To train a frontier model they need labeled data at scale, and the mature Western vendors that provide that, Scale AI, Surge, Mercor, do not have Chinese equivalents operating at the same level. Xu reports that one researcher he met said his team relied heavily on CommonCrawl and that domestic Chinese data vendors returned poor quality. The structural gap is the motive. When you cannot buy the labels and you cannot generate the data, distilling a competitor's outputs starts to look like the shortest path to a competitive model.

That framing changes the policy conversation. The US has been tightening export controls on frontier model weights, and the White House's National Security Memorandum on artificial intelligence, NSTM-4, published in April 2026, signals continued pressure. But if the bottleneck for independent Chinese labs is training data, not model weights, then gating access to American frontier models addresses only part of the problem. The same dynamic also explains why export-control actions reported by the Wall Street Journal appear to have hurt independent labs more than the big-tech in-house teams, which can keep training on their own proprietary data while the frontier is cut off.

There is a second-order prediction worth carrying out of the story. When frontier access closes, the labs most likely to find workarounds are the ones with the deepest proprietary data lakes. The labs most likely to be choked are the ones that were already leaning on distillation. The split is not national in the way the public argument implies. It runs through the structure of the Chinese AI industry itself, between the four in-house teams of Alibaba, ByteDance, Tencent, and Xiaomi, and the independent firms competing with them for frontier capability.

The hard limit of this analysis is also worth naming. Adversarial distillation at the scale Anthropic reported, 16 million exchanges, is not a minor competitive tactic, and Anthropic's complaint that competitors are training on its outputs is legitimate. The fact that the motive is structural does not make the act any less of an intellectual property violation under Anthropic's terms of service. The two readings can both be true at once: a real grievance on the company's part, and a real structural reason on the labs' part. The public story that picks one and discards the other is the story that will age badly.

What to watch next is whether the gap widens. If Chinese regulators succeed in building a domestic labeling industry at scale, the structural pressure on independent Chinese labs could ease, and so could the temptation to distill. If export-control actions extend to compute or chip access, the in-house labs' proprietary data becomes a partial buffer, and the independent labs lose a tool they have already used. The frontier race between American and Chinese AI is often described as a national contest, and on some dimensions it is. On the dimension that explains what Anthropic caught in February, it is a contest between two business models inside one Chinese industry, and the outcome will turn on which one has the better data.

The Data Shortcut: Why Some Chinese AI Labs Copy American Models — type0 | type0

The Data Shortcut: Why Some Chinese AI Labs Copy American Models

Sources