The outage impacted instances across all regions which made it hard for clients to mitigate the impact of the outage.
Google is reimbursing Google Compute Engine users up to 25 percent of their monthly charges after an outage that impacted instances across all regions on Monday.
The outage lasted 18 minutes, and did not affect Google App Engine, Google Cloud Storage, or other Google Cloud Platform products. While 18 minutes may not sound like a lot of time, in the cloud world it is. And because the outage impacted multiple regions, it meant clients couldn’t failover to a new region in order to mitigate the impact of the outage.
According to a lengthy and apologetic post mortem on the Google Cloud Platform status page on Wednesday, the issue began when engineers removed an unused GCE IP block from its network configuration and instructed its systems to propagate the new configuration across the network.
“By itself, this sort of change was harmless and had been performed previously without incident. However, on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration,” Google VP of engineer Benjamin Treynor Sloss said.
“In attempting to resolve this inconsistency the network management software is designed to ‘fail safe’ and revert to its current configuration rather than proceeding with the new configuration. However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.”
To read more about the specifics and the timeline of the outage, check out Google’s post mortem which goes into more detail.
According to Google, its engineering teams will be working over the next several weeks on a “broad array of prevention, detection and mitigation systems intended to add additional defense.”
“It is our intent to enumerate all the lessons we can learn from this event, and then to implement all of the changes which appear useful,” he said, noting that “there are already 14 distinct engineering changes planned spanning prevention, detection and mitigation.”