How to cope with the threats of System failures?

Table of contents
Reading Time: 4 minutes

Systems are designed to respond to its users in any case. But, some of the times, due to some reasons system may fail. System failures can cause huge loss to its users as well as also creates some sort of fear into them; and with this fear, users lose trust in the system which definitely affects system’s goodwill and market value. For example, a few months back, Amazon s3 services were unavailable for some hours, caused an outage to lots of popular websites including Quora (its a popular question-answer portal), which affected a lot of its users. No doubt that Quora was innocent in this case because it was totally dependent on s3, but its users hardly care about all these. They only found that Quora was down for some hours. This is one of the recent examples of System failures, there are definitely a lot more on this.

system failures

System Failures can be due to following reasons:

  • A machine on which your application is running may fail.
  • Network failure due to which your application gets completely disconnected from the network.
  • Some manual edits by Humans and they accidentally make the wrong edits or updates.
  • Data centers put their entire system on maintenance.
  • And lot more reasons…

The question arises that how to handle these sort of situations?

Well, you can not control the above-mentioned situations in any way, but you can handle the situations in such a way to avoid any downtime/outage to your system. And the only solution is to Replicate the system/application.

What is Replication?

Replication means making some copies of your actual data such that in the case of any misshapen or system failures, your actual data remains safe (as you have a backup).

Traditionally, for any important document, we used to make several photocopies of the document in order to prevent any mishaps. But since moving to the digital world, the concept of the photocopy is being taken as in the form of Replication.

Image result for data replication

With replication, there comes the biggest challenge of maintaining consistency. It is really very tough to manage or coordinate between multiple systems. Mismatch between your actual system and replicated/backup system actually means data inconsistency and we all know very well about the ill-effects of system’s inconsistency.

While replicating the system, you need to make sure that you replicate your system over multiple locations. You should not replicate on the same location (is totally insane to do this). For example, your system is running on US zone and also replicated on the same zone. Suddenly, an outage occurred on US zone and your system goes down completely. You then decided to bring replica system in front and start using that, but wait a minute, that system is also in the same zone which means your replica system is also down.
So, always make sure that your replicated system is not in the same location with your actual system.

While the biggest advantage of replication is to reduce system failure, but some other includes:

  • Improved reliability – Since, data is replicated to more than one geographical locations, you can be sure that data is available all the time even in the case of some hardware failure.
  • Improved availability- Replication adds availability improvements to your system. When one data center is unavailable at the particular moment, you can pick another copy of data to meet 100% availability.
  • Improved performance by distributing load- Replication adds performance improvements to your system because, in the case of an excessive load on your system, you can distribute the load between your replicas, which is called Load distribution.

With replication/distribution, you can also take advantage of reduced response time. Let’s say, your system runs on US zone which is replicated somewhere in Europe zone. If a client request comes from the US, that can be handled from US zone’s system, and if client fires request from England, that can be fulfilled from Europe zone. It need not be taken by US zone, reducing network latency, and thus also reducing response time.

Now you should have gained enough idea about replication and its advantage on your system, but the question arises that when should you replicate your system? Well, if your system is not a mission-critical system and its users can afford some period of downtime, then obviously, you need not replicate your system. But, if your application requires to run 24*7 or it is a mission-critical system, then obviously, you must replicate your system using appropriate replication strategies.

Replication also adds cost to your system.  If your system does not allow you to add cost then you probably will not look for replication.



Written by 

Harshit Daga is a Sr. Software Consultant having experience of more than 4 years. He is passionate about Scala development and has worked on the complete range of Scala Ecosystem. He is a quick learner & curious to learn new technologies. He is responsible and a good team player. He has a good understanding of building a reactive application and has worked on various Lightbend technologies like Scala, Akka, Play framework, Lagom, etc.

1 thought on “How to cope with the threats of System failures?4 min read

Comments are closed.