Chaos Engineering: A Science-based Approach to System Reliability

Tech leaders do everything in their power to ensure continuous delivery of excellent software to their customers

The idea of purposely causing issues to a system in production may seem counter-intuitive, but as outages continue to wreak havoc at even the largest and most innovative companies today, it’s become increasingly clear that the best way to prevent disaster is to manufacture and test directly against it.

Unexpected outages already threaten companies, so why risk it? The short answer is that the rising practice of Chaos Engineering is currently the best path to system resilience. The proactive approach is gaining popularity with organizations everywhere as it’s an effective way to ensure that the system will continually deliver maximum uptime.

Since the beginning of the year Microsoft Azure,Zoom, Twitter and IBM all had outages that made the news—and these are just a few examples from 2020. An outage that only lasts an hour can take a massive toll on a business. According to the ITIC 2020 Global Server Hardware, Server OS Reliability Report, 98 percent of survey respondents said the cost of downtime for a single hour exceeds $150,000. Further, customers who rely on these organizations risk consumers losing trust in the brand or leaving for a competitor. All it takes is one outage, and the painful reality is that every organization is at risk.

What Exactly is Chaos Engineering?

While testing is standard practice in software development, it’s not always easy to foresee issues that can happen in production. Especially as systems become increasingly complex to deliver maximum customer value. The adoption of microservices enables faster release times and more possibilities than we’ve ever seen before, however they introduce challenges.

According to the 2020 IDG cloud computing survey, 92 percent of organizations’ IT environments are at least somewhat in the cloud today. In 2020, we saw highly accelerated digital transformation as organizations had to quickly adjust to the impact of a global pandemic.

With added complexity comes more possible points of failure. The trouble is that we humans managing these intricate systems cannot possibly understand or foresee all of the issues because it’s impossible to understand how each of the individual components of a loosely coupled architecture will relate to each other.

This is where Chaos Engineering steps in to proactively create resilience. The major caveat of Chaos Engineering is that things are broken in a very intentional and controlled manner while in production, unlike regular QA practices, where this is done in safe development environments. It is methodical and experimental and less ‘chaotic’ than the name implies.

When it comes to any system that is increasing in complexity through digital transformation, tech leaders need to consider all of the potential system weaknesses that humans cannot possibly expect or understand. All of the pre production testing in the world cannot possibly foresee every potential issue, so this idea of breaking things on purpose while in production is an increasingly adopted practice. The official definition of Chaos Engineering is:

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.

Where Did Chaos Engineering Come From?

Chaos Engineering originated at Netflix. Gremlin’s detailed history of Chaos Engineering explains Chaos Monkey, Netflix’s resiliency tool that was created as they moved from a physical movie rental service to a streaming service.

When they moved to the new environment and became reliant upon Amazon Web Services, Netflix learned that hosts could be terminated and replaced at any time. Netflix needed their systems to be resilient and able to respond to failure, determining the best way to do this was through real-world scenarios. Chaos Monkey was created to “test system stability by enforcing failures via the pseudo-random termination of instances and services within Netflix’s architecture” according to Gremlin’s summary. This tool marked the beginning of what eventually evolved into the accepted practice of Chaos Engineering.

How does it work?

Chaos engineering is experimental and should be approached scientifically. offers a manifesto of sorts and has identified four steps that experiments should follow:

1. Define your ‘steady state’ (a measurable output of a system that indicates normal behavior).

2. Make a hypothesis. What do you think will happen?

3. Introduce variables that reflect real-world events: crashed servers, hard drives that malfunction, severed network connections, etc.

4. Try to disprove the hypothesis.

Further, these experiments should be run in production to create the most authentic results and ideally, they should be automated. The key to not breaking the entire system is to “minimize the blast radius” by keeping the experiment small at first and scientifically executing a well thought out plan. Chaos Engineers should determine the potential reach—how many customers could be affected and at which locations if the experiment results in an outage. Having a minimal impact on the system and a team standing by for incident response minimizes inconveniences to customers.

Through the process, tech leaders can then determine whether they want to further scale an experiment or scrap it and begin the work to resolve any weaknesses that have been identified.

About the Author

Lev Shur, President of Exadel Solutions at Exadel. For more than 20 years, Exadel has been delivering Digital Transformation services, enterprise and custom software solutions for Fortune 500 clients, including HPE, Deloitte, Home Depot and McKesson. With 20+ locations and delivery centers across the US and Europe, Exadel solves the most complex engineering problems using Agile methodologies,

Featured image:  ©LightField Studios