Spark Broadcast Variables Simplified With Example

Reading Time: 3 minutes

Welcome back everyone, Today we will learn about a new yet important concept of Apache Spark called Broadcast variables. For new learners, I recommended starting with a Spark introduction blog.

What is a Broadcast Variable

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Imagine you want to make some information, that information can be variable , rdd, collection of object, large datasets, databse connetion or anything, you want to make this information avaialable to all of your worker so that your executors can use that information & process that data as part of executing task, that’s process will complete by broadcast variable.

Use Case

Le’s Imagine,. I have a large table of zip codes/pin code and want to perform the transformation on that data for analysis.

Here, it is neither feasible to send the large lookup table every time to the executors, nor can we query the database every time. so, the solution should be to convert this lookup table to a broadcast variable and Spark will cache it in every executor for future reference.

This will solve two main problems namely network overhead and time consumption

How it Works.?

Here, in the above diagram I created broadcast variable named m in driver sc.braodcast(m). M is broadcaste to all the workers and task running under these executors can access that variable whenever required

The above diagram shows the internal working of Broadcast Manager (BroadcastManager) is a Spark service to manage broadcast variables in Spark. It creates for a Spark application when SparkContext is initialized and is a simple wrapper around BroadcastFactory.

ContextCleaner is a Spark service that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, any many more.

With aim at reducing the memory requirements of long-running data-heavy Spark applications.

How to use broadcast variables on RDD.

This example defines commonly used data (country and states) in a Map variable and distributes the variable using SparkContext.broadcast() and then use these variables on RDD map() transformation


import org.apache.spark.sql.SparkSession

object RDDBroadcast extends App {

  val spark = SparkSession.builder()
    .appName("SparkByExamples.com")
    .master("local")
    .getOrCreate()

  val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
  val countries = Map(("USA","United States of America"),("IN","India"))

  val broadcastStates = spark.sparkContext.broadcast(states)
  val broadcastCountries = spark.sparkContext.broadcast(countries)

  val data = Seq(("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  )

  val rdd = spark.sparkContext.parallelize(data)

  val rdd2 = rdd.map(f=>{
    val country = f._3
    val state = f._4
    val fullCountry = broadcastCountries.value.get(country).get
    val fullState = broadcastStates.value.get(state).get
    (f._1,f._2,fullCountry,fullState)
  })

  println(rdd2.collect().mkString("\n"))

}

The output will look like

NOTE : When your work is complete with a broadcast variable, you should destroy it to release memory.

broadcast.destroy()

Methods available in Broadcast class

For more information about broadcast variables refer link

Conclusion

In this Spark Broadcast variable blog you have learned what is Broadcast variable, it’s benifits and how to use in RDD and Dataframe with scala example. stay tuned for upcoming blogs!

Written by 

Chitra Sapkal is a software consultant at Knoldus Inc. having experience of 2 years. Knoldus does Big Data product development on Scala, Spark, and Functional Java. She is a self-motivated, passionate person who is recognized as a good team player, Her hobbies include playing badminton and travelling.