GCP live migrations - The GPU and TPU Limitation - Why AI Workloads are Different

As organizations lean into AI and machine learning leveraging Google Cloud, GPU and TPU powered workloads are becoming mission critical. There is an important nuance to how GCP maintains it's infrastructure, enter live migration. While standard VMs migrate seamlessly during maintenance, GPU and TPU VMs don't get the same treatment, and that matters.

Tools for the Job

Both GPUs and TPUs are critical for AI/ML pipelines. This includes: long running training jobs, fine-tuning foundation models and real-time inference services. These tools are leveraged by organization's apps and services heavily.

AI/ML Workloads

These are unlike traditional workloads. They often need access to specialized hardware, low latency, and run for hours or even days with no interruption. Because there is a reliance on GPU/TPU memory state and tight integration with hardware, this makes them much more sensitive to disruptions and unable to accommodate seamless migration.

Imagine when moving, trying to get a grand piano through a small doorway, or a giant box that is bigger than the door frame out of the house. It's not practical.

GPU

Graphical Processing Units are used for deep learning, serving high volume inference requests and training large models. They are designed for massive parallelism.

TPU

Tensor Processing Units are Google silicon optimized for TensorFlow and large scale ML workloads and AI tasks.

Live Migration Limitation

GPUs and TPUs are not live migrated. For instances running GPUs, GCP will usually give a 60 min. notice and for AI Hypercomputer clusters only about 10 min.

TPUs are also impacted during maintenance and the nodes are stopped then restarted.

3 Reasons Why

Tight Integration with hardware: While CPU memory can be serialized and streamed relatively easily, GPUs and TPUs have state tied directly to hardware, like GPU memory and tensor cores. That can’t simply be lifted and shifted.

State Transfer Complexity: ML workloads running on GPUs/TPUs modify memory at extremely high rates. Capturing and syncing this state without disrupting the workload is not realistic as of this writing.

Driver and Kernel Dependency: GPU/TPU drivers can be deeply coupled to the host os and hardware, making seamless migration far more complex than with standard CPUs.

In the next and final post I will touch on what I leaned regarding strategies for resilience to mitigate these limitations, so AI workloads remain resilient even during GCP's planned maintenance.