Tuning apache spark application with speculation


What happen if spark job will be slow its a big question for application performance so we can optimize the jobs in spark with speculation, Its basically start a copy of job in another worker if the existing job is slow.It will not stop the slow execution of job both the workers execute the job simultaneously.

To make our job speculative we need to set configuration as one of follows :-

we can set speculation configuration in our code as follows

val conf=new SparkConf().set("spark.speculation","false")

Or, We can add configuration dynamically at time for deploy application with the help of flag as follows

--conf spark.speculation=false

Or, We can also add it in properties file with the space delimiter as follows

spark.speculation false

Now its time to deep dive, with speculation we have lot of questions like:
It will produce redundant result?
In what time speculation will happen?
If most of our job were completed then its reliable to speculate job?

So, firstly actions in spark are always idempotent, spark handle it for us that even task will be executed for multiple times but only add it in result once.Spark processes the result generated from each partition by passing them to resultHandler and it is always equal to partition.size

We also need to define how often the task will be check for the speculation for that we can define configuration as

spark.speculation.interval (time in milliseconds)

By default it is 100 ms

The speculation take place at the end of the stages is not make sense so for that we can also set configuration that speculation the stage only if particular percentage of tasks were not completed. For that we need to set configuration as follow

spark.speculation.quantile (value in percentage)

By default its value is 0.75

Last last we can set configuration to set at how much level the tasks will slow will be speculate for that we define the configuration

spark.speculation.multiplier

By default its value is 1.5 so if task is 1.5 times slower than it will be consider for the speculation.

Advertisements

About sandeep

I m working as an software consultant in Knoldus Software LLP . I m working on scala, play, spark,hive, hdfs, hadoop and many big data technologies.
This entry was posted in apache spark, big data, Scala, Spark and tagged , , , , , . Bookmark the permalink.

4 Responses to Tuning apache spark application with speculation

  1. Yu-Jhe says:

    About “spark.speculation.quantile” configuration, i think it should be “we can also set configuration that speculation the stage only if particular percentage of tasks were completed. For that we need to set configuration as follow”, instead of “particular percentage of tasks were NOT completed”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s