How To Run Chaos Experiments on Chaos Mesh

Reading Time: 6 minutes

In this blog, we will be going to learn about Chaos Engineering Concepts and its various tools

WHAT IS CHAOS ENGINEERING

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability and to withstand turbulent conditions in production.

In simple words, it is throughput, planned experiments designed to reveal the weakness in systems.

Basically, you have to think about something and plan and experiment around that.

So that, you can reveal the weakness in your software system may be hardware or software system.

WHY DO WE NEED CHAOS ENGINEERING

  • Nowadays, most companies are moving toward microservice architecture because it provides you flexibility.
  • So, you see these microservice architectures are very complex and distributed and there are countless connections and dependencies with each other.
  • It is practically very difficult to test each and everything from traditional OA practices.
  • So, you really want a mechanism where you can test these things pro-actively.
  • That’s why we need Chaos Engineering.
  • So that, we can find issues in the large distributed systems.

PRINCIPLES OF CHAOS ENGINEERING(How it works)

1. DEFINE STEADY-STATE

It is nothing but the normal behavior of a system over time. Now, what is the normal behavior of a system?

Example– I am a user of a shopping website and I want to track my order, so this is the normal behavior of the system. Basically this is the steady state for your application.

If you know your steady state then you want to have few system metrics and business metrics.

System metrics are those which you are getting from your system maybe operating system or your application, latency etc,.

Business metrics are the number of logins per minute during peak. These are those metrics which you need for the business and number of failed logins per minute etc,.

2. FORM YOUR HYPOTHESIS

So, you have to use words like “use what-ifs to find it”.

Example 1– If a random Virtual Machine is terminated, THEN failures are negligible(at <10 per failed logins per 10000 logins).

Basically, this is your hypothesis and from negligible you are going to say that less than 10 failed logins per 10000 logins.

3. PLAN AND RUN YOUR EXPERIMENT

Now, that we know about steady-state, we have built the hypothesis also, now, we have to plan and run the experiments. So, for planning, you need to basically gather all the stakeholders, look at the architecture of your application and then you have to contain the Blast Radius.

What is Blast Radius?

Example- Let’s say we have 50 Virtual Machines and those are in production and you really don’t want to put the chaos or failure in all the 50 VMs because of this your customer will suffer.

So, let’s say you decide to keep these things only in one VM so, that’s how you have to contain the Blast Radius. Basically, you are defining your scope OK that I am going to put this chaos engineering in 1,2, or 3 systems, not in all.

When you plan this kind of exercise, notify your organization that we are going to run this chaos experiment if something goes down BE READY!

Have a stop button ready if something goes wrong then you have to abort the test.

4. MEASURE AND LEARN

Use your metrics to prove and disprove the hypothesis. You have to ask these kinds of questions:-

  • Was the system resilient to the injected failure?
  • Did anything unexpected happen?
  • Share your progress and success with the application.
Principles

In short summary

  • You have a steady-state defined
  • Created a hypothesis
  • Ran an experiment(if the experiment was successful scale-up and request)
  • If the experiment was failed, find and fix the issue or also abort the conditions
  • Don’t forget to minimize the Blast Radius

How chaos engineering is different from testing

Testing is a kind of validation where you know this is the input and that is the output and you validate it.

But Chaos Engineering is all about experimentation, where you make a hypothesis, and then you do some experiments which nobody has done.

Example- You have a microservice architecture, in architecture the service A is talking to service B and service C etc, so, it’s your hypothesis, it’s your thinking how service A affects service B, then you have to plan experiments according to prove your theory. That is how chaos engineering is different from testing.

SOME OPEN-SOURCE TOOLS WHICH ARE ON THE MARKET

  1. CHAOS MONKEY- Netflix has something called SIMIAN ARMY. So, basically, they have an army of tools that can test the chaos inside the production environment. They named it Chaos Monkey. This was the first tool created by Netflix in order to identify and test their application inside the production in AWS so that they can build confidence in their systems. Chaos Monkey randomly kills a microservice and sees what happens to the microsystem behavior.
  2. CHAOS MESH- It is a CNCF Sandbox Project. It’s a powerful Chaos engineering platform for Kubernetes. It creates a disruption to kill the pods, latency, network, and system input and output. All the experiments in chaos mesh are written in YAML files.
  3. GREMLIN- Gremlin provides a platform to run chaos experiments in a safe, simple, and secure way. It offers software as a service technology and it is also used to test system resiliency with different attack modes. Gremlin is also automated with CI/CD and integrated with Kubernetes.
  4. LITMUS- Litmus is an open-source platform designed for cloud-native infrastructures. It identifies system deficiencies by performing controlled chaos tests.

DEMO

I am using the CHAOS MESH tool for the demo

Install chaos mesh using Helm(you can also use other options for installation)

Install Helm First

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3

chmod 700 get_helm.sh

./get_helm.sh

Chaos Mesh Installation

Add chaos mesh repository

helm repo add chaos-mesh https://charts.chaos-mesh.org

View the chaos mesh version

helm search repo chaos-mesh

Now create Namespace

kubectl create ns chaos-testing

Now use Docker to install chaos mesh(use can install it in other environments also)

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing --version 2.1.5

Verify the installation

kubectl get po -n chaos-testing

OUTPUT

Check the services also running in the same namespace

Now open the chaos mesh dashboard

Use your Minikube IP to access the dashboard with 30972

Now, let’s do some chaos experiments

First, you have to generate the token so click on generate.

Now, select cluster scoped and role as manager and then copy the RBAC.yaml file and apply that file in the terminal.

After that use the below command to get the token

kubectl describe secrets account-cluster-manager-kepic

Now, provide your token name and copy the token and submit

Let’s start creating the experiments

Create a namespace

kubectl create ns test-chaos

Create deployment in the namespace

kubectl create deployment chaos-engineering --image=redis --namespace test-chaos

Scale the deployment up to 8 pods

kubectl scale deployment/chaos-engineering --replicas=8 --namespace test-chaos

To confirm everything is up and running

kubectl get pods -n test-chaos

OUTPUT

Now, create a chaos experiment in the YAML file

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: chaos-mesh
spec:
  mode: one
  selector:
    labelSelectors:
      "app.kubernetes.io/component": "tikv"
  stressors:
    cpu:
      workers: 1
      load: 100
      options: ["--cpu 2", "--timeout 600", "--hdd 1"]
  duration: "30s"

YAML Description- The test will burn 1 CPU every 30 seconds

Apply the YAML file

kubectl apply -f chaos-eng.yaml

Now, go to the Chaos Mesh Dashboard and check the Experiments

As you can see the chaos-mesh experiment is Finished

Check events now

So, this is how you can create your experiments and check whether everything is working fine or not.

You can also create experiments through the dashboard itself

Go to EXPERIMENTS and create a new experiment then select EXPERIMENT TYPE

I am selecting Kubernetes as the experiment type and pod fault then selecting pod failure as an experiment(you can choose any type of experiment)

Then fill in the experiment information and submit

Check experiments

As you can the experiment is finished and after that check for events

So, that’s how you can use chaos-mesh for your applications.

REFERENCES

Chaos mesh

Chaos Engineering

Demo App