Availability is defined by uptime. I.e. the time between failures. It is dependent upon
- Downtime and
- Recovery Time
Downtime is the amount of time that the system is unavailable due to either failures or scheduled maintenance.
Recovery time is the average time it takes to recover from failures. This includes time for detection, isolation and resolution.
Hence to have high availability, you must have low downtime and low recovery time.
High availability can be introduced by building redundancy into the system, hence the failure of a server is transparent to the user.
So when you have a cluster of server what does it imply? It implies that you have better availability now. Clustering is just for availability and failover, it is not for scalability, though you can build a scale-out mechanism by adding more servers.
So what does the slang five nines or four nines mean to you? Well, if we consider a 24×7 environment like that of eBay, amazon etc then four nines would mean that in a year the total downtime + recovery time is approximately 52 minutes and 30 seconds. And three nines would mean downtime + recovery time of 8 hours and 45 minutes
4 nines –> (365×24) – .9999(365×24) = 8760 – 8759.124 = 0.876 hours = 52 minutes and 30 secs
3 nines –> (365×24) – .999(365×24) = 8760 -8751.24 = 8.76 hours = 8 hours and 45 minutes
Now, if someone tells you that there site is 99.99% available and they had a scheduled maintenance of 2 hours last week, what does it mean? Are they lying? 🙂
Not necessarily, many people use four nines or five nines to represent their availability with respect to the SLA that they have agreed upon.
The SLA might be that they are supposed to be available 20×6 instead of 24×7. So in a typical year their four nines availability as per the SLA would be
four nines as per SLA (20×6) = ((365-52)*20) – .9999(((365-52)*20)) = 6260 – 6259.374 = 37 minutes 30 secs as per the SLA.
but then given that there are 8760 hours in a year the total downtime allowed per the SLA is
8760 – 6259.374 = 2500.626 hours = 104 days and 4 hours (approx)
So though the service must be available four nines as per the SLA of (20 hours and 6 days) there is still a allowed downtime of 104 days.
And the (% uptime with SLA of 20×6) w.r.t (% SLA at 24×7) = 6259.374 / 8751.24 * 100 = 71.5%
The bottomline : Do not take 99.99% on the face value without knowing the agreed upon SLA.
6 thoughts on “What Does 99.99% Availability Mean?2 min read”
I would submit that when availability is important, building a cluster is not the most efficient or cost effective route. Clusters are complex, they failover, and as you point out, four nines uptime is about all you can hope for. A fault tolerant server provides more than five-nines availability, runs out of the box, supports Windows and Linux applications (with only one license required instead of two or more) as well as VMware unmodified, has no failover, downtime or data loss from hardware issues. If an application or network failure takes a fault-tolerant server down, you can restart applications immediately while still being able to root-cause the problem so that it doesn’t happen again. Clusters are popular because it’s the best solution the giant server makers can offer, but it’s not the best solution for availability.
Hi Ken, I am not sure if I really got the entire point. Even with virtualization, you would have to host the VMware images on server hardware, which again would be prone to hardware failures. Hence if I have 4 virtual servers running on a single hardware then am I not prone to a SPOF w.r.t hardware? I agree that if it is an application failure then it would easier solved in this case.
For any HA solution, I reckon, you would need a combination of physical and virtual server solutions. It would be difficult to look just one way and ignore the other.
Hi Vikas. Looking at the environment in its entirety is definitely the best way to go. RE using fault-tolerant servers as part of a virtual infrastructure, these servers have no single point of failure. They have complete redundancy with all components in active operation all the time. That’s one of the biggest benefits of a fault tolerant architecture, which is designed to prevent failure from occurring and not to recover quickly after failure as HA clusters do. From a simple transient error to a CPU failure, an FT server and VMs continue to run unaffected with no performance degration, downtime or data loss. Not every application will need this degree of uptime protection. But, the more VMs you load on to a physical server, the greater the potential impact to the organization. (By way of full disclosure, my employer is Stratus Technologies.)
Hi Ken, As we know that 3 things are particularly important for system success. no loss of data, continuous availability of the system, and good system performance. I guess with active operation the first 2 are handled fine, not sure about the performance part. Do you have some whitepapers for the architecture that you recommend? I agree with the VM strategy and historically in the past we have not only used them as a part of HA but also for re-purposing the machines in the production environment depending on the load. Good to know about Stratus. I would check out the offerings.
Hi Vikas, There are many white papers on our website at stratus.com dealing with architecture and virtualization deployments. Here are a couple you may find informative:
Thank you for your interest.
Perfect Ken. I would post questions to you, if any once I go through them. Thanks for sharing the info.
Comments are closed.