
Welcome back everyone, Today we will learn about a new yet important concept of Apache Spark called Broadcast variables. For new learners, I recommended starting with a Spark introduction blog.
What is a Broadcast Variable
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
Imagine you want to make some information, that information can be variable , rdd, collection of object, large datasets, databse connetion or anything, you want to make this information avaialable to all of your worker so that your executors can use that information & process that data as part of executing task, that’s process will complete by broadcast variable.
Use Case
Le’s Imagine,. I have a large table of zip codes/pin code and want to perform the transformation on that data for analysis.
Here, it is neither feasible to send the large lookup table every time to the executors, nor can we query the database every time. so, the solution should be to convert this lookup table to a broadcast variable and Spark will cache it in every executor for future reference.
This will solve two main problems namely network overhead and time consumption
How it Works.?



Here, in the above diagram I created broadcast variable named m in driver sc.braodcast(m). M is broadcaste to all the workers and task running under these executors can access that variable whenever required



The above diagram shows the internal working of Broadcast Manager (BroadcastManager
) is a Spark service to manage broadcast variables in Spark. It creates for a Spark application when SparkContext is initialized and is a simple wrapper around BroadcastFactory.
ContextCleaner
is a Spark service that is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, any many more.
With aim at reducing the memory requirements of long-running data-heavy Spark applications.
How to use broadcast variables on RDD.
This example defines commonly used data (country and states) in a Map variable and distributes the variable using SparkContext.broadcast()
and then use these variables on RDD map() transformation
import org.apache.spark.sql.SparkSession
object RDDBroadcast extends App {
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))
val broadcastStates = spark.sparkContext.broadcast(states)
val broadcastCountries = spark.sparkContext.broadcast(countries)
val data = Seq(("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
)
val rdd = spark.sparkContext.parallelize(data)
val rdd2 = rdd.map(f=>{
val country = f._3
val state = f._4
val fullCountry = broadcastCountries.value.get(country).get
val fullState = broadcastStates.value.get(state).get
(f._1,f._2,fullCountry,fullState)
})
println(rdd2.collect().mkString("\n"))
}
The output will look like



NOTE : When your work is complete with a broadcast variable, you should destroy it to release memory.
broadcast.destroy()
Methods available in Broadcast class



For more information about broadcast variables refer link
Conclusion
In this Spark Broadcast variable blog you have learned what is Broadcast variable, it’s benifits and how to use in RDD and Dataframe with scala example. stay tuned for upcoming blogs!


