Tuning apache spark application with speculation

What happen if spark job will be slow its a big question for application performance so we can optimize the jobs in spark with speculation, Its basically start a copy of job in another worker if the existing job is slow.It will not stop the slow execution of job both the workers execute the job simultaneously.

To make our job speculative we need to set configuration as one of follows :-

we can set speculation configuration in our code as follows

val conf=new SparkConf().set("spark.speculation","false")

Or, We can add configuration dynamically at time for deploy application with the help of flag as follows

--conf spark.speculation=false

Or, We can also add it in properties file with the space delimiter as follows

spark.speculation false

Now its time to deep dive, with speculation we have lot of questions like:
It will produce redundant result?
In what time speculation will happen?
If most of our job were completed then its reliable to speculate job?

So, firstly actions in spark are always idempotent, spark handle it for us that even task will be executed for multiple times but only add it in result once.Spark processes the result generated from each partition by passing them to resultHandler and it is always equal to partition.size

We also need to define how often the task will be check for the speculation for that we can define configuration as

spark.speculation.interval (time in milliseconds)

By default it is 100 ms

The speculation take place at the end of the stages is not make sense so for that we can also set configuration that speculation the stage only if particular percentage of tasks were not completed. For that we need to set configuration as follow

spark.speculation.quantile (value in percentage)

By default its value is 0.75

Last last we can set configuration to set at how much level the tasks will slow will be speculate for that we define the configuration

spark.speculation.multiplier

By default its value is 1.5 so if task is 1.5 times slower than it will be consider for the speculation.

4 thoughts on “Tuning apache spark application with speculation

  1. About “spark.speculation.quantile” configuration, i think it should be “we can also set configuration that speculation the stage only if particular percentage of tasks were completed. For that we need to set configuration as follow”, instead of “particular percentage of tasks were NOT completed”

Leave a Reply

%d bloggers like this: