Today we’ll find out what circuit breakers are, why you should be using them in any distributed system
or microservices architecture along with Istio.
Why Circuit Breaker?
We’ve got a series of microservices here that don’t have to be microservices ( in Kubernetes these will be pods, services, etc ). They’re going to be connected together in some way, a cascading failure is where you have a problem with just one part of the system which can bring down the whole system at once.
We all know we can’t ever assume that a network is reliable, there will always be failures. And wherever possible we need our design to be fault-tolerant. We’re here talking specifically about computer networks but generally applies to any failure, where one or few parts can trigger the failure of other parts, and so on. this is called Cascading Failure.
Imagine that you going to be working on series of microservice architectures so consider the above picture where we’ve got very busy levels of traffic going from C microservice to D . We’ve suddenly hit a situation where D microservice for a reason we don’t even know, has started performing slowly. And now we’ve got a huge stack of incoming requests, which I’m trying to denote here with red lines, these are all hanging requests, taking 30 sec to service. And that’s the ones that are successful, many of them are just going to timeout.
Now the problem here is that this is likely to cause C microservice to start failing due to a finite number of open connections that we can have from one pod to another. I hope we get the idea now, that could then cause a knock-on problem if C service starting to struggle. And then B service starts to struggle. And it goes all the way back through the chain.
Reasons why cascading failures are absolutely horrible?
- We don’t know when this is going to happen.
- They’ve all gone down, or a large proportion of the architecture has gone down.
- They are often very, very difficult to trace as symptoms of the problem are across the entire system.
- They being pragmatic, oftentimes after a cascading failure, you just don’t have the time or resources to find the root cause.
The circuit breaker pattern allows you to build a fault tolerant and resilient system that can survive gracefully when key services are either unavailable or have high latency.
Many of us might be familiar with a library called Hystrix came from the Netflix project built their own circuit breaking library which they released as open source software.We need to put it into our application that’s implemented into A microservice. Hystrix will stop communication with B microservices in case it got difficult in connecting to B microservices due to some network issue.
However, In 2018 it was announced Netflix abandoned active development of hystrix. There are big problems with that, you need to build it into every single micro service in your system so that’s tedious in itself. But you can also imagine that you might forget to build it into one particular micro service, then you’re exposing yourself to the potential of having a cascading failure.
ISTIO : The saviour
Istio extends Kubernetes to establish a programmable, application-aware network using the powerful Envoy service proxy. Working with both Kubernetes and traditional workloads, Istio brings standard, universal traffic management, telemetry, and security to complex deployments.
Istio resiliency strategy to detect unusual host behaviour and evict the unhealthy hosts from the set of load balanced healthy hosts inside a cluster.
Istio help us here to configure Circuit breaker into services with the help of DestinationRule and it sub-components.
apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: circuit-breaker-for-the-whole-default-namespace spec: host: "demo -service.default.svc.cluster.local" # This is the name of the k8s service that we're configuring trafficPolicy: outlierDetection: # Circuit Breakers have been SWITCHED ON maxEjectionPercent: 100 consecutive5xxErrors: 2 interval: 10s baseEjectionTime: 30s
Above DestinationRule contains following Field :
- consecutive5xxErrors Number of 5xx errors before a host is ejected from the connection pool.
- maxEjectionPercent Maximum % of hosts in the load balancing pool for the upstream service that can be ejected. Defaults to 10%.
- baseEjectionTime Minimum ejection duration. A host will remain ejected for a period equal to the product of minimum ejection duration and the number of times the host has been ejected.
- interval Time interval between ejection sweep analysis. format: 1h/1m/1s/1ms. MUST BE >=1ms. Default is 10s.
That’s pretty much it from the article, you have can sample example over your cluster. If you have any feedback or queries, please do let me know in the comments. Also, if you liked the article, please give me a thumbs up and I will keep writing blogs like this for you in the future as well. Keep reading and Keep coding.
- Istio Hands-On for Kubernetes