In this blog, we will be going to learn about Chaos Engineering Concepts and its various tools
WHAT IS CHAOS ENGINEERING
Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability and to withstand turbulent conditions in production.
In simple words, it is throughput, planned experiments designed to reveal the weakness in systems.
Basically, you have to think about something and plan and experiment around that.
So that, you can reveal the weakness in your software system may be hardware or software system.
WHY DO WE NEED CHAOS ENGINEERING
- Nowadays, most companies are moving toward microservice architecture because it provides you flexibility.
- So, you see these microservice architectures are very complex and distributed and there are countless connections and dependencies with each other.
- It is practically very difficult to test each and everything from traditional OA practices.
- So, you really want a mechanism where you can test these things pro-actively.
- That’s why we need Chaos Engineering.
- So that, we can find issues in the large distributed systems.
PRINCIPLES OF CHAOS ENGINEERING(How it works)
1. DEFINE STEADY-STATE
It is nothing but the normal behavior of a system over time. Now, what is the normal behavior of a system?
Example– I am a user of a shopping website and I want to track my order, so this is the normal behavior of the system. Basically this is the steady state for your application.
If you know your steady state then you want to have few system metrics and business metrics.
System metrics are those which you are getting from your system maybe operating system or your application, latency etc,.
Business metrics are the number of logins per minute during peak. These are those metrics which you need for the business and number of failed logins per minute etc,.
2. FORM YOUR HYPOTHESIS
So, you have to use words like “use what-ifs to find it”.
Example 1– If a random Virtual Machine is terminated, THEN failures are negligible(at <10 per failed logins per 10000 logins).
Basically, this is your hypothesis and from negligible you are going to say that less than 10 failed logins per 10000 logins.
3. PLAN AND RUN YOUR EXPERIMENT
Now, that we know about steady-state, we have built the hypothesis also, now, we have to plan and run the experiments. So, for planning, you need to basically gather all the stakeholders, look at the architecture of your application and then you have to contain the Blast Radius.
What is Blast Radius?
Example- Let’s say we have 50 Virtual Machines and those are in production and you really don’t want to put the chaos or failure in all the 50 VMs because of this your customer will suffer.
So, let’s say you decide to keep these things only in one VM so, that’s how you have to contain the Blast Radius. Basically, you are defining your scope OK that I am going to put this chaos engineering in 1,2, or 3 systems, not in all.
When you plan this kind of exercise, notify your organization that we are going to run this chaos experiment if something goes down BE READY!
Have a stop button ready if something goes wrong then you have to abort the test.
4. MEASURE AND LEARN
Use your metrics to prove and disprove the hypothesis. You have to ask these kinds of questions:-
- Was the system resilient to the injected failure?
- Did anything unexpected happen?
- Share your progress and success with the application.
In short summary
- You have a steady-state defined
- Created a hypothesis
- Ran an experiment(if the experiment was successful scale-up and request)
- If the experiment was failed, find and fix the issue or also abort the conditions
- Don’t forget to minimize the Blast Radius
How chaos engineering is different from testing
Testing is a kind of validation where you know this is the input and that is the output and you validate it.
But Chaos Engineering is all about experimentation, where you make a hypothesis, and then you do some experiments which nobody has done.
Example- You have a microservice architecture, in architecture the service A is talking to service B and service C etc, so, it’s your hypothesis, it’s your thinking how service A affects service B, then you have to plan experiments according to prove your theory. That is how chaos engineering is different from testing.
SOME OPEN-SOURCE TOOLS WHICH ARE ON THE MARKET
- CHAOS MONKEY- Netflix has something called SIMIAN ARMY. So, basically, they have an army of tools that can test the chaos inside the production environment. They named it Chaos Monkey. This was the first tool created by Netflix in order to identify and test their application inside the production in AWS so that they can build confidence in their systems. Chaos Monkey randomly kills a microservice and sees what happens to the microsystem behavior.
- CHAOS MESH- It is a CNCF Sandbox Project. It’s a powerful Chaos engineering platform for Kubernetes. It creates a disruption to kill the pods, latency, network, and system input and output. All the experiments in chaos mesh are written in YAML files.
- GREMLIN- Gremlin provides a platform to run chaos experiments in a safe, simple, and secure way. It offers software as a service technology and it is also used to test system resiliency with different attack modes. Gremlin is also automated with CI/CD and integrated with Kubernetes.
- LITMUS- Litmus is an open-source platform designed for cloud-native infrastructures. It identifies system deficiencies by performing controlled chaos tests.
I am using the CHAOS MESH tool for the demo
Install chaos mesh using Helm(you can also use other options for installation)
Install Helm First
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh
Chaos Mesh Installation
Add chaos mesh repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
View the chaos mesh version
helm search repo chaos-mesh
Now create Namespace
kubectl create ns chaos-testing
Now use Docker to install chaos mesh(use can install it in other environments also)
helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing --version 2.1.5
Verify the installation
kubectl get po -n chaos-testing
Check the services also running in the same namespace
Now open the chaos mesh dashboard
Use your Minikube IP to access the dashboard with 30972
Now, let’s do some chaos experiments
First, you have to generate the token so click on generate.
After that use the below command to get the token
kubectl describe secrets account-cluster-manager-kepic
Now, provide your token name and copy the token and submit
Let’s start creating the experiments
Create a namespace
kubectl create ns test-chaos
Create deployment in the namespace
kubectl create deployment chaos-engineering --image=redis --namespace test-chaos
Scale the deployment up to 8 pods
kubectl scale deployment/chaos-engineering --replicas=8 --namespace test-chaos
To confirm everything is up and running
kubectl get pods -n test-chaos
Now, create a chaos experiment in the YAML file
apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: chaos-mesh spec: mode: one selector: labelSelectors: "app.kubernetes.io/component": "tikv" stressors: cpu: workers: 1 load: 100 options: ["--cpu 2", "--timeout 600", "--hdd 1"] duration: "30s"
YAML Description- The test will burn 1 CPU every 30 seconds
Apply the YAML file
kubectl apply -f chaos-eng.yaml
Now, go to the Chaos Mesh Dashboard and check the Experiments
Check events now
So, this is how you can create your experiments and check whether everything is working fine or not.
You can also create experiments through the dashboard itself
Go to EXPERIMENTS and create a new experiment then select EXPERIMENT TYPE
Then fill in the experiment information and submit
As you can the experiment is finished and after that check for events