Circuit Breaker in microservices : Istio

Reading Time: 4 minutes

Hey guys,
Today we’ll find out what circuit breakers are, why you should be using them in any distributed system
or microservices architecture along with Istio.

Why Circuit Breaker?

We’ve got a series of microservices here that don’t have to be microservices ( in Kubernetes these will be pods, services, etc ). They’re going to be connected together in some way, a cascading failure is where you have a problem with just one part of the system which can bring down the whole system at once.

Cascading Failures

We all know we can’t ever assume that a network is reliable, there will always be failures. And wherever possible we need our design to be fault-tolerant. We’re here talking specifically about computer networks but generally applies to any failure, where one or few parts can trigger the failure of other parts, and so on. this is called Cascading Failure.

Series of microservices having problem while connecting with struggling microservice.

Imagine that you going to be working on series of microservice architectures so consider the above picture where we’ve got very busy levels of traffic going from C microservice to D . We’ve suddenly hit a situation where D microservice for a reason we don’t even know, has started performing slowly. And now we’ve got a huge stack of incoming requests, which I’m trying to denote here with red lines, these are all hanging requests, taking 30 sec to service. And that’s the ones that are successful, many of them are just going to timeout.

Now the problem here is that this is likely to cause C microservice to start failing due to a finite number of open connections that we can have from one pod to another. I hope we get the idea now, that could then cause a knock-on problem if C service starting to struggle. And then B service starts to struggle. And it goes all the way back through the chain.

Reasons why cascading failures are absolutely horrible?
  • We don’t know when this is going to happen.
  • They’ve all gone down, or a large proportion of the architecture has gone down.
  • They are often very, very difficult to trace as symptoms of the problem are across the entire system.
  • They being pragmatic, oftentimes after a cascading failure, you just don’t have the time or resources to find the root cause.

Solution

The circuit breaker pattern allows you to build a fault tolerant and resilient system that can survive gracefully when  key services are either unavailable or have high latency.

Hystrix

Many of us might be familiar with a library called Hystrix came from the Netflix project built their own circuit breaking library which they released as open source software.We need to put it into our application that’s implemented into A microservice. Hystrix will stop communication with B microservices in case it got difficult in connecting to B microservices due to some network issue.


However, In 2018 it was announced Netflix abandoned active development of hystrix. There are big problems with that, you need to build it into every single micro service in your system so that’s tedious in itself. But you can also imagine that you might forget to build it into one particular micro service, then you’re exposing yourself to the potential of having a cascading failure.

ISTIO : The saviour

Istio extends Kubernetes to establish a programmable, application-aware network using the powerful Envoy service proxy. Working with both Kubernetes and traditional workloads, Istio brings standard, universal traffic management, telemetry, and security to complex deployments.
Istio resiliency strategy to detect unusual host behaviour and evict the unhealthy hosts from the set of load balanced healthy hosts inside a cluster.

Proxy container have Circuit breaker

Istio help us here to configure Circuit breaker into services with the help of DestinationRule and it sub-components.

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: circuit-breaker-for-the-whole-default-namespace
spec:
  host: "demo -service.default.svc.cluster.local"    # This is the name of the k8s service that we're configuring

  trafficPolicy:
    outlierDetection: # Circuit Breakers have been SWITCHED ON
      maxEjectionPercent: 100
      consecutive5xxErrors: 2
      interval: 10s
      baseEjectionTime: 30s

Above DestinationRule contains following Field :

  • consecutive5xxErrors Number of 5xx errors before a host is ejected from the connection pool.
  • maxEjectionPercent Maximum % of hosts in the load balancing pool for the upstream service that can be ejected. Defaults to 10%.
  • baseEjectionTime Minimum ejection duration. A host will remain ejected for a period equal to the product of minimum ejection duration and the number of times the host has been ejected.
  • interval Time interval between ejection sweep analysis. format: 1h/1m/1s/1ms. MUST BE >=1ms. Default is 10s.

That’s pretty much it from the article, you have can sample example over your cluster. If you have any feedback or queries, please do let me know in the comments. Also, if you liked the article, please give me a thumbs up and I will keep writing blogs like this for you in the future as well. Keep reading and Keep coding.

Reference :

Written by 

I always love to learn and explore new technologies. Having working skills in Linux, AWS, DevOps tools Jenkins, Git, Maven, CI-CD, Ansible, Scripting language (shell/bash), Docker as well as ELK stack and Grafana for implementing logging and visualization technology.