What is Decoupled DiLoCo?

Decoupled DiLoCo is a distributed training architecture from Google DeepMind designed to train large language models across geographically distant data centers. It breaks training into separate islands of compute that communicate asynchronously. This allows training to continue even if specific hardware units fail, making the process more resilient than traditional synchronized methods.

How does Decoupled DiLoCo handle hardware failures?

The system uses a decoupled approach where training runs are divided into separate learner units. If a chip or cluster fails in one area, it does not interrupt the progress of other units. The architecture is self-healing, allowing failed units to be seamlessly reintegrated into the training process once they come back online without restarting the entire run.

What are the bandwidth requirements for Decoupled DiLoCo?

Decoupled DiLoCo is highly efficient, requiring significantly less bandwidth than conventional distributed training methods. In testing, it reduced required wide-area network bandwidth from approximately 200 Gbps to less than 1 Gbps. This allows frontier models to be trained using standard internet connectivity between data centers rather than requiring specialized, high-speed custom network infrastructure between facilities.

Can Decoupled DiLoCo use different types of AI chips together?

Yes, the architecture supports hardware heterogeneity, meaning it can mix different generations of hardware in a single training run. Google demonstrated this by successfully combining TPU v6e and TPU v5p chips. Despite running at different speeds, the mixed hardware matched the machine learning performance of single-chip training runs, allowing older hardware to contribute to modern model training.

What models have been trained using Decoupled DiLoCo?

Google DeepMind used the architecture to successfully train a 12 billion parameter Gemma model across four separate regions in the United States. The experiment proved that the system could achieve production-level results 20 times faster than conventional synchronization methods while maintaining the same level of benchmarked accuracy and performance as traditional, tightly coupled training setups.

Google DeepMind Trains Frontier Models Across Distant Data Centers With Decoupled DiLoCo

Gemini

LLM

AI Research

AI Hardware

Performance

Published Apr 24, 2026

Google DeepMind released Decoupled DiLoCo, a distributed training architecture that enables large-scale model pre-training (the initial phase of teaching a model on massive datasets) across multiple data centers. Unlike methods requiring synchronized chips, this approach uses asynchronous data flow to divide training into independent "islands" of compute.

This architecture solves the "synchronization tax" that stalls training runs when a single chip fails. By isolating disruptions, the system maintains high availability during hardware outages, mirroring the pattern seen in distributed reinforcement learning infrastructure. It also enables mixing different hardware generations, such as TPU v6e and v5p, without performance loss.

While currently an internal research breakthrough, this makes global, fragmented compute a viable alternative to multi-billion dollar mega-clusters. A technical report details how a 12B parameter Gemma model was trained across four US regions using standard internet bandwidth, bypassing single-site capacity constraints.

Read the full update →

Frequently asked questions

What is Decoupled DiLoCo?: Decoupled DiLoCo is a distributed training architecture from Google DeepMind designed to train large language models across geographically distant data centers. It breaks training into separate islands of compute that communicate asynchronously. This allows training to continue even if specific hardware units fail, making the process more resilient than traditional synchronized methods.
How does Decoupled DiLoCo handle hardware failures?: The system uses a decoupled approach where training runs are divided into separate learner units. If a chip or cluster fails in one area, it does not interrupt the progress of other units. The architecture is self-healing, allowing failed units to be seamlessly reintegrated into the training process once they come back online without restarting the entire run.
What are the bandwidth requirements for Decoupled DiLoCo?: Decoupled DiLoCo is highly efficient, requiring significantly less bandwidth than conventional distributed training methods. In testing, it reduced required wide-area network bandwidth from approximately 200 Gbps to less than 1 Gbps. This allows frontier models to be trained using standard internet connectivity between data centers rather than requiring specialized, high-speed custom network infrastructure between facilities.
Can Decoupled DiLoCo use different types of AI chips together?: Yes, the architecture supports hardware heterogeneity, meaning it can mix different generations of hardware in a single training run. Google demonstrated this by successfully combining TPU v6e and TPU v5p chips. Despite running at different speeds, the mixed hardware matched the machine learning performance of single-chip training runs, allowing older hardware to contribute to modern model training.
What models have been trained using Decoupled DiLoCo?: Google DeepMind used the architecture to successfully train a 12 billion parameter Gemma model across four separate regions in the United States. The experiment proved that the system could achieve production-level results 20 times faster than conventional synchronization methods while maintaining the same level of benchmarked accuracy and performance as traditional, tightly coupled training setups.

Company News

Google DeepMind Partners With Global Consultancies to Scale Agentic AI

Google DeepMind Trains Frontier Models Across Distant Data Centers With Decoupled DiLoCo

Frequently asked questions

Related

Related

Trending