Live Migration on the GCP

In regards to compute resources, there are times when Google's hardware undergoes maintenance, patching or the hardware itself fails. Live migration will move the vm to another physical server often without anyone noticing.

How it works

Looking down from 30,000 feet, there are 3 Major steps in the process.

Source Brownout: During the source brownout, the VM on the host machine remains active and continues to serve traffic. Compute engine copies and streams most of the state to another machine or target host. Any changes during this time is tracked for syncing later. Due to the overhead of copying and tracking memory there could be a slight degradation.

Think about when you're moving from one house to another. There are boxes being packed and moved but you can still cook dinner and do laundry in the old place. You keep track of what got left behind so that you can move them at the appropriate time.

Blackout: This is when the VM is paused briefly and any remaining state is sent to the target host. This is mostly memory that was written to after it was initially copied during the first phase - the source brownout. Once the transfer is complete the VM continues running on the target host. Again, slight degradation may occur from copying memory and tracking changes. The pause is usually less than 1 second, but can be up to 5 seconds in rare cases. The system clock may jump 5 seconds to account for the pause.

This would be like unplugging a lamp in one room then plugging it back in another and it turning on with the exact same output of light.

Target Brownout: At this stage the VM is now running on the new machine (target host). The source host may provide temporary support such as forwarding network traffic while the network fabric catches up to the new location of the VM. There is minimal disruption and most users and applications won't notice any changes.

One way to think of this, is like settling into your new house and the postal service forwards your mail to the new address for a short time.

In the next post I go into some detail on AI workloads and the limitiation with regards to live migrations with GPUs and TPUs.