Hi Readers, In current market scenario we are seeing most of the businesses going online and their applications are having customer footfall increasing only day by day. This has obviously increased profit in businesses. However, with all these benefits, they are also facing many issues as well just like outages, engineering incidents, service glitches, application downtime and many more. As these also lead to missed deadlines or project delays like scenarios as well. With all these incidents it was the need of hour to describe issues in numbers and track those metrics so that the issues could be resolved as soon as possible.
Common Incident metrics
- MTBF stands for Mean Time Between Failure
- MTTR stands for Mean Time To Recovery/Response/Repair/Resolve
- MTTF stands for Mean Time To Failure
- MTTA stands for Mean Time To Acknowledge
Let’s understand each metrics
Mean Time Between Failure (MTFB)
- Average time between repairable failure of a technology product. It tracks availability and reliability.
- MTFB is directly proportional to Reliability.
Accessing Time: 24 hrs Downtime: 4 hrs No. of Incidents: 2
Uptime: 24 - 4 = 20 hrs MTBF = Uptime/No. of Incidents = 20/2 = 10 hrs
- MTBF tracks reliability.
- Downtime during scheduled maintenance not included.
- Focuses only on unexpected outage and repairable systems.
Mean Time To Recovery/Response/Repair/Resolve (MTTR)
Mean Time To Repair
- Average time to repair and test system.
Accessing Time: 7 days No. of Incidents: 5 Downtime: 2hrs
MTTR = Downtime(in mins)/No. of Incidents = 2 * 60/5 = 24 mins
- It is useful in keeping repairs on track.
- MTTR is inversely proportional to reliability.
Mean Time To Recovery
- Average time to recover from a system failure.
- This is calculated from time of outage till it become fully operational again.
Downtime: 20min No. of Incidents: 2 Assessment Time: 24 hrs
MTTR = Downtime/No. of Incident = 20/2 = 10 mins.
- MTTR (recovery) is useful when we want to assess speed of over all recovery process.
- Although there could be issues with alert systems as well that we are getting incident alert with some delay or diagnostic is not quick or repairing is not so efficient.
Mean Time To Resolve
- Average time to fully resolve an issue or failure.
- Time to detect failure, diagnose problem, repair and ensuring failure won’t happen again.
Assessment Time: 24 hrs Downtime: 2 hrs No. of Incident: 1 Time to ensure error won't happen again: 2 hrs
MTTR = Downtime + Time to ensure error won't happen again = 2 + 2 = 4 hrs
- This metric is for unplanned incidents.
Mean Time To Respond
- Average time to recover from a system failure from time of first alert.
- This doesn’t include alert lag time.
Assessment Time: 40 hrs No. of Incident: 4 Time to fix: 1 hr
MTTR = Time to fix/No. of Incident = 1 * 60 / 4 = 15 mins
- This is more useful in cyber security to nullify attacks.
Mean Time To Acknowledge (MTTA)
- Average time between “when alert is triggered” to actual work started on issue.
No. of Incident: 10 Time b/w alert and acknowledge: 40 mins
MTTA = Time b/w alert and acknowledge/No. of Incident = 40/10 = 4 mins
- MTTA is useful is tracking responsiveness of the application.
Mean Time To Failure (MTTF)
- Average time between non-repairable failure of a product.
- It determines how long a system will last.
- It helps to provide customers about a time when they can have schedule checkups to avoid downtime.
No.of products: 4 Time till Product 1 lasts - t1: 10 hrs. Time till Product 2 lasts - t2: 9 hrs. Time till Product 3 lasts - t3: 8 hrs. Time till Product 4 lasts - t4: 12 hrs.
MTTF = t1+t2+t3+t4/No.of products = (10+9+8+12)/4 = 9.75 hrs.
However, this may look very weird, that the products lasts for 10,9,8,12 hrs respectively. That’s why MTTF is useful only for products having short life span.
- For technical products which typically lasts longer time, MTTF is useful for notifying scheduled maintenance as we don’t wait nether we want our application to fail.
Pictorial representation of metrics and incident flow
That’s all for this blog. In this blog we understood various incident metrics and we also saw their use cases along with their calculation.
Thank you for following this blog till end. If you found this blog helpful do share this blog with your colleagues. In case of any feedback, suggestion or question reach out to me at email@example.com.