Broadcast variables in Spark, how and when to use them?

Reading Time: 2 minutes

As documentation for Spark Broadcast variables states, they are immutable shared variable which are cached on each worker nodes on a Spark cluster.  In this blog, we will demonstrate a simple use case of broadcast variables.

When to use Broadcast variable?

Think of a problem as counting grammar elements for any random English paragraph, document or file. Suppose you have the Map of each word as specific grammar element like:

Let us think of a function which returns the count of each grammar element for a given word.

and use this function to count each grammar element for the following data:

Before running each tasks on the available executors, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD.

In the above snippet we’ve sent the dictionary as value to function.  This is all right until we are running it locally on single executor. In cluster environment, it will give Spark a huge communication and compute burden when this dictionary will be needed by each executor. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task.

Supposedly we had a large English dictionary containing each possible word with its grammatical illustration, the cost would have been more as we send it as raw value with closures. As documentation recites, explicitly creating broadcast variables are only beneficial when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

How to create and use Broadcast variables?

Broadcast variables are wrappers around any value which is to be broadcasted. More specifically they are of type: org.apache.spark.broadcast.Broadcast[T] and can be created by calling:

The variable broadCastDictionary will be sent to each node only once. The value can be accessed by calling the method .value() on broadcast variables. Let us make little change in our method getElementsCount which now looks like:

Instead of sending the raw dictionary we will pass broadCastDictionary with words RDDs.

On collect, the result would be:

Things to remember while using Broadcast variables:

Once we broadcasted the value to the nodes, we shouldn’t make changes to its value to make sure each node have exact same copy of data. The modified value might be sent to another node later that would give unexpected results.

References:

http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

Learning Spark: Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Written by 

Manish Mishra is Lead Software Consultant, with experience of more than 7 years. His primary development technology was Java. He fell for Scala language and found it innovative and interesting language and fun to code with. He has also co-authored a journal paper titled: Economy Driven Real Time Deadline Based Scheduling. His interests include: learning cloud computing products and technologies, algorithm designing. He finds Books and literature as favorite companions in solitude. He likes stories, Spiritual Fictions and Time Traveling fictions as his favorites.

3 thoughts on “Broadcast variables in Spark, how and when to use them?3 min read

  1. “val words = sc.parallelize(Array(“man”,”is”,”mortal”,”mortal”,”1234″,”789″,”456″,”is”,”man”)” should be suffixed with ) otherwise syntax error. meaning…
    val words = sc.parallelize(Array(“man”,”is”,”mortal”,”mortal”,”1234″,”789″,”456″,”is”,”man”)) is correct. may be copy paste mistake

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading