SRE: Eliminating toil

Reading Time: 3 minutes

Hello everyone, As we all know the meaning of “toil“. In this blog, we are going to see what exactly it is in SRE and how we are going to calculate it.

In the daily routine of our organization, we need to do some work that we don’t like at all e.g paperwork, attend meetings and sending emails, and many more. We call them “toil” as we need to do these tasks again and again. But it is not correct, this is some task we must have to do considering the long-term values. We can call them as “overhead”.

So…

What exactly the “toil” is ?

The task that has a direct relation to the production service and If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.

The toil can be defined as

  • Manual: The task you need to perform manually. e.g Even if you run some script to automate some tasks but running that script manually is called “toil”.
  • Repetitive: If you need to perform some task first time or second time you can not be called them “toil”, but if you are performing the same task, again and again, more than 3 -4 times.
  • Automatable: Every task that can be automatable.
  • Tactical:
  • No-enduring value: After executing some tasks successfully or making changes there is no single change in your project.
  • O(n)-with service growth: Adding some feature you need to change the in a task that you performing automatically e.g CI-CD pipeline.

Why SRE says toil has to be less ?

As we all know that eliminating toil 100% percent is not possible. But reducing it is possible There are some reason which will clarify that why toil has to be less

  1. SRE organization has an advertised goal of keeping operational work below 50% of each SRE’s time. It means we are going to spend more than or at lean 50 % of the time on actual project work. As rather than trying to reduce toil in the operation task we can reduce it more at the time creating actual project.
  2. It tends to expand if left unchecked and can quickly fill 100% of everyone’s time.
  3. when we hire new SREs, we promise them that SRE is not a typical Ops organization, quoting the 50% rule just mentioned. If we spend all the time on operational work and reducing toil at the time of operational task it will be unfair to the SRE members and they will be trying to search for new opportunities.

How to Calculate…?

As in SRE, everything is measured, then how we are going to calculate the toil?

As we said SRE spend 50% of there time on reducing toil, then how they spend is shown in the below diagram

  In every cycle of 6-week SRE has to spend at least 2 week to being on call.

1 week: For primary call -> In which discussion happen on issues which we need to fix.

2 week: For secondary call -> If anything misses in primary call is discussed in secondary call.

There are toatal 6 members of SRE are on call in rotation. It means 2 differant persion with differant duties has to be on call (primary and secondary)so they are become seconadry call for each other.

If you need to understand “Being on call in detail” you can refer

So calculation for ….

Toil is 2/6 = 33% of an SRE’s time.

It means we have a lot time to improve the service and reduce toil , as according to google also the avg time we need to reduce time is 33% and we can spend at lest 50% .

So the last question arises in everyone mind is ….

Is toil always bad ?

If there is a case that we need just a few seconds to execute the task and get quick wins if someone likes to do manually and happy with it.

So in that case wether toil is bad ?

The answer is yes. Even if the person is happy with toil and get quick result for the task. The reason for this are as follows:

  1. Career stagnation: Your career progress will slow down or grind to a halt if you spend too little time on projects.
  2. Low morale: If there is a toil within the limit, then ok. But if there is too much toil leads to burnout, boredom, and discontent.
  3. Creates confusion: Individuals or teams within SRE that engage in too much toil undermine the clarity of that communication and confuse people about our role.
  4. Slows progress: Excessive toil makes a team less productive.
  5. Sets precedent: If you are willing to take on manual and repetitive work  Other teams may also start expecting SREs to take on such work, which is bad for obvious reasons.
  6. Causes breach of faith: New hire will feel cheated, which is bad for morale

Referances:

Knoldus-blog-footer-image

Leave a Reply