Training a frontier AI model normally means keeping thousands of chips in near-perfect lockstep. One hardware failure and everything stops. DeepMind just removed that single point of failure, and it did it with bandwidth you can buy from any internet service provider.
On Thursday the lab published Decoupled DiLoCo, a distributed training system that trained a 12 billion parameter model across four separate U.S. regions using 2 to 5 gigabits per second of wide-area networking, ordinary internet infrastructure rather than the custom high-throughput links AI operations typically require. According to DeepMind's blog post, the system was more than 20 times faster than conventional synchronization methods and delivered equivalent machine learning performance. The team proved it worked by deliberately breaking their own hardware during training runs: using a technique called chaos engineering, they introduced artificial failures and watched the system seamlessly reintegrate failed units when they recovered. Netflix uses the same approach to keep its servers running when individual machines crash.
Distributed training normally requires all parts of a model to update simultaneously, which means every piece of hardware must communicate constantly at tens of gigabits per second. DiLoCo, a method first published in November 2023, changed that by having compute nodes exchange updates only at fixed intervals rather than continuously. This eliminates the blocking bottleneck where one lagging component holds everything back. Decoupled DiLoCo layers on two earlier Google systems, Pathways and the original DiLoCo, to enable what the paper calls islands of compute that train independently and synchronize periodically. A hardware failure in one location does not cascade. Other islands keep learning. When a failed unit recovers, it rejoins without losing the group's accumulated progress.
The practical effect is a system that runs on existing infrastructure instead of requiring purpose-built data center links. DeepMind argues this turns stranded compute, idle capacity in underutilized data centers, into usable training capacity at frontier scale. It also supports mixing different hardware generations in a single training run, potentially extending the useful life of existing chips as new accelerators arrive at different times in different places. An independent team at Prime Intellect replicated the original DiLoCo method in open source last year, according to its blog, suggesting broad industry interest in lower-bandwidth distributed training approaches. IEEE Spectrum has covered the broader push toward decentralized AI training infrastructure.
The results come from Google's own benchmarking and have not yet been replicated by independent researchers. The Gemma 4 12B model is smaller than the largest models in active development, where synchronization costs scale differently and failure modes multiply. And decoupling introduces latency by design: tasks requiring tight real-time coordination between model components may not benefit from this approach.
What to watch next is whether the advantage holds as training runs scale toward larger models. Google has published the paper and calls the infrastructure production-ready. Microsoft and Amazon have published similar work on distributed training, suggesting the industry is converging on the same problem. Whether the solution scales with it remains an open question.