Why you should intentionally break your perfectly working servers often?

A quick introduction to Chaos Engineering

Dharani Sowndharya
5 min readMar 3, 2021

You are working on a very critical release a few days before your much awaited vacation. Surprisingly, everything went according to plan. That junior developer whom you thought would definitely mess up didn’t. You’ve checked everything that could go wrong in your system and you are looking forward to a fun filled trip with your friends and leave early that day.

Just when you are packing, you get a call from your Ops team. You immediately know that there is some issue. You attend the call with a groan and you hear that the team is facing a downtime because of a failed region in the cloud provider that you have hosted your servers on. One of the cloud services are down and hence all of them that depend on it are showing the dreaded red status in the cloud provider’s status page.

You realize with a panic that, even though you’ve tested for the high availability option, you didn’t check the high availability in the scale of a region.

As a systems administrator or a lead, is this scenario relatable? Often times, the system goes down with a completely new reason that you’ve not come across before and you are put in a situation where you have to be quick on your feet and cool in your head to learn on the spot and rectify it.

Chaos engineering is a practice to help you in a situation like this.

As per official documentation,

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Google, Microsoft, Facebook, Amazon have been using some variation of this set up to test their resiliency from the early 2000s

Despite the concept’s presence in the tech industry from long ago, Netflix introduced the discipline of chaos engineering to the wide world. To safeguard their multi million dollar business which was completely dependant and operated on cloud, they introduced a tool named Chaos monkey all the way back in 2011. It is part of a bunch of tools named Simian Army(now deprecated) in which each of them can cause various levels of destruction to your infrastructure set up. Some of them are:

Chaos Monkey — Causing destruction to random servers in the infra
Chaos Gorilla — It simulates disabling a full AZ / Data center
Chaos Kongs — It simulates disabling a complete region
Latency monkey — Introduces delays in communication that we commonly face during outages

They named it after monkeys because, they wanted to simulate a behavior like that of a destruction that a real monkey could cause in a datacenter by pulling all the cables and destroying everything in sight.

As recognition for this discipline grew, a lot of other vendors came into picture to provide this as a service.

Some of the products in the market are Gremlin, Chaos toolkit, Toxi Proxy, Pumba etc

A practice named Game day is followed in which the teams tested their resiliency to prove their hypothesis on the behaviour of their system. In this practice, teams and the respectice stake holders sit together to execute disruptive experiments in a safe and controlled manner. They can then work on their recovery procedures and validate their system performance.

The game days need not be very complicated. You can conduct a game day and check if the Ops team is able to handle issues based on the steps provided in your runbook, or you can see if all your alerts are working as expected or an experiment to see if your DNS service is able to route to a different region in case of a simulation in regional failover

You need not depend on external vendors to conduct this. You can create your own chaos engineering set up. For example, if you are using AWS, you can write code in boto3 to shut down instances of an ASG at regular periods of time and monitor how it affects your system by running a load test.

As per the Principles of Chaos Engineering: (https://principlesofchaos.org) there are few steps that you have to keep in mind before starting with your own chaos set up

  1. There should be a clear definition of steady state which you wish to accomplish and you have to define this before starting your experiment. In simple terms, it could be that you want your website to always show 200 as status despite failures in certain targeted servers. Resiliency can be acheived by a simple set up by having multiple instances of your server in an Auto scaling group spanning over multiple availability zones and you can see if this set up allows you to get a steady state.
  2. Consider and note the measurable outputs of the system that you are going to monitor. It can be your latency, error rates, throughput etc. This will give a clear picture on how your system gets affected by these experiments
  3. Simulate issues that closely resemble real life events. Bring in to your mind on the last 5 or 6 real time issues that you faced. Maybe, the SSL cert got expired, a sudden increase in users, surge in the usage of memory or cpu or the cloud provider had an outage that made your set up unstable. Use these scenarios and try to simulate them either with the help of existing tools or with your own code
  4. If you are conducting this in Production, make sure that the destruction that you try to introduce could be contained in some way possible with a kill switch that would stop the experiment after a certain duration or after a certain scope. Make sure that there is always a recovery plan to mitigate issues caused by this experiment
  5. If possible, automate this whole process

This process will allow us to learn more about our system. You could be doing a load test, but you’ll know that the application unexpectedly crashes when the count of instances reach a certain limit. This would be something that you wouldn’t have expected, but scenarios like this will prompt us to understand more about our system and why certain issues occur.

End the whole exercise with retrospection to understand what went well and what could have gone better. Maybe it could be better communication to the stake holders, rewriting your runbook to handle incidents with more details or just making changes to accommodate more resources in your system.

Is there someone in your team, maybe that junior developer who has a flair for breaking stuff, you can assign him this job to intentionally break your servers to his heart’s content and learn as a team from the experiments about the observations on the system behaviour

— — — — — — — —

References:

https://principlesofchaos.org

Inspired from discussions with: Siva S, Distinguished Engineer, Trimble Inc, Chennai

--

--

Dharani Sowndharya

Constantly Curious | Cloud Engineer | Writer | …. .- …. .-