Several online services were shut down on Thursday evening due to an error at Google Cloud.
The outage shut down parts of Cloudflare, which also caused our own website to go offline. Prominent services that were affected to varying degrees included Google’s own services, Spotify, Discord, and more. Those who did not depend on Google and/or Cloudflare did not seem to experience any problems.
Problems, but no problems
Google communicated that it was experiencing problems and that a root cause analysis would follow later.
A Cloudflare spokesperson stressed to Techzine that this was not a Cloudflare outage. Instead, the CDN provider reports that this is a “Google Cloud outage.” “A limited number of services at Cloudflare use Google Cloud and were impacted. The core Cloudflare services were not impacted.”
Waiting for answers
At the time of writing (11:50 p.m.), Google Cloud’s status page is littered with warning signs, indicating ongoing problems. However, Cloudflare states that its own services are largely back up and running. The fact that you can read this article suggests that this is correct.
The exact cause remains to be seen, but Google has a good reputation in this area. A prolonged outage in January 2022 was accompanied by a detailed explanation, which may be similar to what we will hear from Google later. However, that is speculation.
At the time, routine maintenance of an SDN component led to an unexpected error within Google Cloud. According to Google, this maintenance led to an application failover, whereby a new active replica was called up from an earlier checkpoint. Normally, this happens without any problems, but the replica did not include a critical piece of configuration information. This error spread to roughly 15 percent of the network switches serving the us-west1-b region. However, reprogramming the switches triggered a race condition in the firmware, causing them to crash. All of this happened automatically, requiring manual recovery. Ultimately, Google Cloud was down for roughly 3.5 hours in the us-west1-b region at the time.
Read also: German report offers insight into major impact of CrowdStrike outage