A week after the main AWS region went down, Microsoft Azure also experienced a global outage. The downtime forced companies of all shapes and sizes to stop working or rely on alternatives. Apart from all the hassle this caused, monetary damages started piling up too. What are the costs for organizations that have embraced the public cloud, only to see it go offline? And how can you reduce those costs, if at all?
Workers around the world are increasingly affected by cloud outages. They are practically unavoidable. And because the IT problems are located in someone else’s data center, organizations are often unable to assess the exact damage and how long the problems will last. They also occur for different reasons. The AWS US-EAST-1 region was knocked out by a DNS error, Azure went offline last night due to a CDN problem, and two years ago, a fire at a Paris location led to three weeks of limited service for Google Cloud customers.
They all lead to roughly the same problem: downtime. Partial downtime is more common, with a single application going offline, for example. There are countless causes for this, but again, they are largely indistinguishable to the end user. However, we are concerned with a more drastic impact. You could call this complete downtime. In this case, an organization is unable to perform its core tasks. Traditionally, this was due to the organization’s own IT infrastructure or an external factor such as a natural disaster. Nowadays, the cloud transition is already a fait accompli for many. There is no longer a “box” in the closet, only PCs as endpoints to access cloud services. For those parties, the damage can be significant, as research shows.
Costs quickly spiral out of control
At the end of last year, Splunk researchers spoke of the ‘hidden costs’ of downtime. These costs are not so hidden, given that their findings are part of a long tradition of similar studies. In any case, the top 2000 companies in the world pay approximately $400 billion for downtime each year. A simple calculation reveals that these organizations, including the Dutch companies ASML, Nationale Nederlanden, AkzoNobel, Philips, and Randstad, lose around $200 million from their annual accounts due to unplanned downtime. Incidentally, what the Splunk study really revealed were the hidden costs of financial damage caused by problems with security tools, infrastructure, and applications. These can wipe billions off market values.
The $200 million estimate focuses explicitly on the 2,000 richest companies in the world. Most organizations cannot afford similar damage. Concrete examples cited by Atlassian include a 12-hour Apple Store outage that cost $25 million and Facebook going offline for 14 hours in 2019. More recently, there was the CrowdStrike outage, which cost the top 500 companies on earth $5.4 billion.
For a fairer picture for the average organization, we need to look elsewhere. It is clear that vendors have conducted research on this topic. That does not mean we should question the findings, but it does mean we should take into account a certain bias when citing the data. Take New Relic, a vendor of an observability platform. Organizations without full-stack observability lose about $2 million per hour to a ‘high business impact’ failure. We are bothered by the fact that this business impact is never defined (when does an outage have a ‘high’ impact?). In any case, this simply concerns companies that have been surveyed by New Relic, so we can take the promise that full-stack observability can halve that $2 million with a grain of salt.
A more conservative estimate of downtime costs can be found at Information Technology Intelligence Consulting, which conducted research on behalf of Calyptix Security. The majority of the parties surveyed had more than 200 employees, but the combination was more diverse than the top 2000 companies worldwide. The costs of downtime were substantial: at least $300,000 per hour for 90 percent of the companies in question. Forty-one percent stated that IT outages cost between $1 million and $5 million.
What can you do about it?
For organizations that have made the move to the cloud, a general public cloud outage is equivalent to the downtime mentioned above. For many companies, the costs are high enough to wipe out their profits in a matter of hours. This has been the case for some time and is a major problem even for small businesses. However, we can continue to dig up studies, but the figures vary each time due to methodologies, the time of the study, and the region. The point remains the same: costs are rising relentlessly fast. What can you do about it?
In theory, the largest companies can rely on a multicloud strategy. In addition, hyperscalers absorb many local outages by routing traffic to other regions. However, multicloud is not something that you can just set up as a start-up SME. In addition, you usually do not build your applications in a fully redundant form in different clouds. Furthermore, it is quite possible that you can continue to work yourself, but that your product is inaccessible. For example, countless sites went offline due to the AWS outage, and a large part of the internet goes down when Cloudflare has a problem.
Nevertheless, it is important to store the most essential data in a location other than the public cloud. Furthermore, a perceived outage may be a unique problem, possibly due to a configuration error. That is why contacting the public cloud service is a necessary step when problems arise. If you have suffered enormous financial damage afterwards, your SLA (Service Level Agreement) may entitle you to a refund. For example, AWS has three levels of API gateway services, each with a different minimum total availability.
Nevertheless, as an organization that embraces the cloud, you cannot simply escape the problems. Downtime is a modern fact due to the nature of the cloud. This does not even have to be the fault of the cloud provider or even the customer organization. A partner or third party is sometimes also the weak link, but usually with partial downtime. A malfunction such as at CrowdStrike is the exception. A problem with AWS, Azure, or Google Cloud is less common. That’s why you need to be prepared, especially financially, but also mentally. If not, every hour of cloud downtime can cost you dearly.
Read also: DNS system issue caused AWS outage