Cloud chaos boosts application resilience

Cloud chaos boosts application resilience

Cloud environments excel at scalability, but that comes with risks. After all, no company is able to fully simulate the volatility of millions of users with a limited QA team. Chaos-driven infrastructure offers a solution to this.

We’re already familiar with Chaos Monkey, a tool deployed by Netflix to introduce planned chaos into its own services to improve their resilience. It essentially tries to stress-test the tech by performing all sorts of demanding operations. After all, it is unthinkable for a party like Netflix to be down for any length of time due to a software problem. Hence the tool, which is akin to a monkey smashing a data centre to smithereens, tries to do everything under the sun to destabilize it. Only by remaining agile under duress can an application truly be called resilient.

Get rid of QA, embrace chaos

Lee Atchison is one of many who advocates passing up on the use of QA in developing cloud-native applications. He praises the dynamic nature of the cloud as a powerful tool for solving problems on the spot in the background. Only then is rapid innovation possible, he believes.

Indeed, chaos-driven infrastructure assumes a balance between chaos injections and automated problem solving. The end result: an application stays up and running.

In 2020, a scientific paper (PDF) appeared from the University of Potsdam in which researchers advocated Risk-driven Fault Injection (RDFI) to ensure security of cloud systems. They therefore developed CloudStrike, which applied this methodology.

Principles

The ideas behind “chaos engineering” are laid out by Principles of Chaos on its website. One is based on 4 main principles. First, it is essential to define a “steady state,” in which a system behaves appropriately. Next, one should hypothesize that this steady state persists in both systems in which chaos is introduced and a control group where it does not. Third, an engineer must introduce variables that mimic the real world: crashing servers, faltering hard drives, and broken Internet connections. Finally, one must look for a deviation from the steady state between the chaos-driven test group and the control group. The goal is obvious: to keep an application in a steady state as much as possible while firing chaotic variables at it.

Practice

Chaos as a tool is not new, but it is becoming increasingly relevant. Cloud adoption is increasing, with many dependencies and multi-cloud environments also adding complexity. All this can count on a lot of challenges. Consider the stress induced by gigantic user numbers accessing services at once, temporary failures at parts of the dependency chain or existing vulnerabilities that cyber-attackers want to exploit.

So we end up with “chaos-based learning,” which draws on applications like Chaos Monkey or CloudStrike as well as practical reality. Of course, it will require a lot of programming code that developers must implement to address larger problems. Also, the chaos approach requires a different mindset than before, where the key is not to get the green light from QA but to actively look for weaknesses.

This can therefore lead to smaller mean-time-to-resolution for problems because they can be planned for. Actual problems seem to have the magical and terrible capacity to take place at the most inconvenient times. 3 a.m. on a Sunday, for example. Chaos testing can be scheduled precisely when support engineers are available.

Vaccine

Crisis management will remain of all times. However, the example of ex-Amazon development manager Kolton Andrus shows what an occasional outage can mean for large companies. He spoke in a 2019 presentation about a problem that caused Amazon.com to be off the air for at least 45 minutes, resulting in millions of dollars in lost sales. He indicated that it is likely that significant incidents occur precisely at times of high traffic that can transform a lucrative day into a missed opportunity.

Because systems are now constantly changing and are unmanageably large, Andrus believes the old approach is no longer workable. Updates no longer come quarterly, but mostly weekly. “There is always something that doesn’t work quite right,” he argues. This applies not only to an individual player like Amazon, but the entire Internet. The solution: chaos engineering, or “breaking things on purpose.”

By tuning in to chaos, these kinds of problems behave like a virus, to be kept out by a vaccine. It’s the same metaphor used in Andrus’ presentation. After all, as an engineer, you have already trained your systems for chaotic actions with organized responses. Andrus also cites fire drills: no one can count on a safe and careful evacuation without being prepared for it. In a world where online environments are becoming essential in more and more sectors, we cannot work with systems that can “hopefully” be quickly made operational again in a crisis situation. For the reassurance of users and workability for software engineers, chaos is needed as a partner against catastrophe.

Also read: European open source world warns of Cyber Resilience Act