What happen if spark job will be slow its a big question for application performance so we can optimize the jobs in spark with speculation, Its basically start a copy of job in another worker if the existing job is slow.It will not stop the slow execution of job both the workers execute the job simultaneously.
To make our job speculative we need to set configuration as one of follows :-
we can set speculation configuration in our code as follows
val conf=new SparkConf().set("spark.speculation","false")
Or, We can add configuration dynamically at time for deploy application with the help of flag as follows
Or, We can also add it in properties file with the space delimiter as follows
Now its time to deep dive, with speculation we have lot of questions like:
It will produce redundant result?
In what time speculation will happen?
If most of our job were completed then its reliable to speculate job?
So, firstly actions in spark are always idempotent, spark handle it for us that even task will be executed for multiple times but only add it in result once.Spark processes the result generated from each partition by passing them to resultHandler and it is always equal to partition.size
We also need to define how often the task will be check for the speculation for that we can define configuration as
spark.speculation.interval (time in milliseconds)
By default it is 100 ms
The speculation take place at the end of the stages is not make sense so for that we can also set configuration that speculation the stage only if particular percentage of tasks were not completed. For that we need to set configuration as follow
spark.speculation.quantile (value in percentage)
By default its value is 0.75
Last last we can set configuration to set at how much level the tasks will slow will be speculate for that we define the configuration
By default its value is 1.5 so if task is 1.5 times slower than it will be consider for the speculation.