Understanding few common incident metrics

Table of contents

Reading Time: 3 minutes

Hi Readers, In current market scenario we are seeing most of the businesses going online and their applications are having customer footfall increasing only day by day. This has obviously increased profit in businesses. However, with all these benefits, they are also facing many issues as well just like outages, engineering incidents, service glitches, application downtime and many more. As these also lead to missed deadlines or project delays like scenarios as well. With all these incidents it was the need of hour to describe issues in numbers and track those metrics so that the issues could be resolved as soon as possible.

Common Incident metrics

MTBF stands for Mean Time Between Failure
MTTR stands for Mean Time To Recovery/Response/Repair/Resolve
MTTF stands for Mean Time To Failure
MTTA stands for Mean Time To Acknowledge

Let’s understand each metrics

Mean Time Between Failure (MTFB)

Average time between repairable failure of a technology product. It tracks availability and reliability.
MTFB is directly proportional to Reliability.

Calculation:

Accessing Time: 24 hrs
Downtime: 4 hrs
No. of Incidents: 2

Uptime: 24 - 4 = 20 hrs
MTBF = Uptime/No. of Incidents = 20/2 = 10 hrs

NOTE:

MTBF tracks reliability.
Downtime during scheduled maintenance not included.
Focuses only on unexpected outage and repairable systems.

Mean Time To Recovery/Response/Repair/Resolve (MTTR)

Mean Time To Repair

Average time to repair and test system.

Calculation:

Accessing Time: 7 days
No. of Incidents: 5
Downtime: 2hrs

MTTR = Downtime(in mins)/No. of Incidents = 2 * 60/5 = 24 mins

NOTE:

It is useful in keeping repairs on track.
MTTR is inversely proportional to reliability.

Mean Time To Recovery

Average time to recover from a system failure.
This is calculated from time of outage till it become fully operational again.

Calculation:

Downtime: 20min
No. of Incidents: 2
Assessment Time: 24 hrs

MTTR = Downtime/No. of Incident = 20/2 = 10 mins.

NOTE:

MTTR (recovery) is useful when we want to assess speed of over all recovery process.
Although there could be issues with alert systems as well that we are getting incident alert with some delay or diagnostic is not quick or repairing is not so efficient.

Mean Time To Resolve

Average time to fully resolve an issue or failure.
Time to detect failure, diagnose problem, repair and ensuring failure won’t happen again.

Calculation:

Assessment Time: 24 hrs
Downtime: 2 hrs
No. of Incident: 1
Time to ensure error won't happen again: 2 hrs

MTTR = Downtime + Time to ensure error won't happen again = 2 + 2 = 4 hrs

NOTE:

This metric is for unplanned incidents.

Mean Time To Respond

Average time to recover from a system failure from time of first alert.
This doesn’t include alert lag time.

Calculation:

Assessment Time: 40 hrs
No. of Incident: 4
Time to fix: 1 hr

MTTR = Time to fix/No. of Incident = 1 * 60 / 4 = 15 mins

NOTE:

This is more useful in cyber security to nullify attacks.

Mean Time To Acknowledge (MTTA)

Average time between “when alert is triggered” to actual work started on issue.

Calculation:

No. of Incident: 10
Time b/w alert and acknowledge: 40 mins

MTTA = Time b/w alert and acknowledge/No. of Incident = 40/10 = 4 mins

NOTE:

MTTA is useful is tracking responsiveness of the application.

Mean Time To Failure (MTTF)

Average time between non-repairable failure of a product.
It determines how long a system will last.
It helps to provide customers about a time when they can have schedule checkups to avoid downtime.

No.of products: 4
Time till Product 1 lasts - t1: 10 hrs.
Time till Product 2 lasts - t2: 9 hrs.
Time till Product 3 lasts - t3: 8 hrs.
Time till Product 4 lasts - t4: 12 hrs.

MTTF = t1+t2+t3+t4/No.of products = (10+9+8+12)/4 = 9.75 hrs.

However, this may look very weird, that the products lasts for 10,9,8,12 hrs respectively. That’s why MTTF is useful only for products having short life span.

NOTE:

For technical products which typically lasts longer time, MTTF is useful for notifying scheduled maintenance as we don’t wait nether we want our application to fail.

Pictorial representation of metrics and incident flow

That’s all for this blog. In this blog we understood various incident metrics and we also saw their use cases along with their calculation.

Thank you for following this blog till end. If you found this blog helpful do share this blog with your colleagues. In case of any feedback, suggestion or question reach out to me at nitin.mishra@knoldus.com.