
Spark SQL
Spark SQL is a module in Apache Spark. It allows users to process structured data using a SQL-like syntax. It integrates seamlessly with the Spark ecosystem, including Spark Streaming and MLlib.
One of the main benefits of using Spark SQL is that it permits to users to integrate SQL queries with the programming language of their choice, such as Scala, Python, or Java. This makes it easy for data engineers to work with structured data in Spark, as they can use the same SQL syntax that they would use in an old database.
Types of Joins in Spark SQL
In Apache Spark, we can use the following types of joins in SQL:
Inner join: An inner join in Apache Spark is a type of join that returns only the rows that match a given predicate in both tables. To perform an inner join in Spark using Scala, we can use the join
method on a DataFrame.
val innerJoinDF = leftDF.join(rightDF, leftDF("key") === rightDF("key"),
"inner")
Left outer join: A left outer join in Apache Spark is a type of join that returns all the rows from the left DataFrame, as well as any matching rows from the right DataFrame. If there is no match, the right-side columns will be filled with null values. To perform a left outer join in Spark using Scala, we can use the join
method on a DataFrame.
val leftOuterJoinDF = leftDF.join(rightDF, leftDF("key") === rightDF("key"),
"left_outer")
Right outer join: Right outer join returns all rows from the right table, and any matching rows from the left table. If there is no match, null values are returned for left table’s columns.
val rightOuterJoinDF = leftDF.join(rightDF, leftDF("key") === rightDF("key"),
"right_outer")
Full outer join: Full outer join returns all rows from both tables, and null values are returned for any missing matches on either side.
val fullOuterJoinDF = leftDF.join(rightDF, leftDF("key") === rightDF("key"),
"full_outer")
Cross join: A cross join in Apache Spark is a type of join that returns the cross product of two tables, i.e., every possible combination of rows from both tables. To perform a cross join in Spark using Scala, we can use the crossJoin
method on a DataFrame.
val crossJoinDF = leftDF.crossJoin(rightDF)
Follow With Code
Let’s understand these with some detailed examples. Here we are creating a spark session and two data frames which we will use to understand spark joins :
import org.apache.spark.sql.SparkSession
// Create a SparkSession
val spark = SparkSession.builder().appName("Joins").getOrCreate()
// Set the log level to ERROR
spark.sparkContext.setLogLevel("ERROR")
// Create the left and right DataFrames
val leftDF = spark.createDataFrame(Seq((1, "x"), (2, "y"), (3,
"z"))).toDF("id", "name")
val rightDF = spark.createDataFrame(Seq((1, "a"), (2, "b"), (3,
"c"))).toDF("id", "state")
Inner join
val innerJoinDF = leftDF.join(rightDF, leftDF("id") === rightDF("id"),
"inner")
innerJoinDF.show()
This code will output the following:
+---+-------+---+-----+
| id| name| id|state|
+---+-------+---+-----+
| 1| x| 1| a|
| 2| y| 2| b|
| 3| z| 3| c|
+---+-------+---+-----+
Left outer join
val leftOuterJoinDF = leftDF.join(rightDF, leftDF("id") === rightDF("id"),
"left_outer")
leftOuterJoinDF.show()
This code will output the following:
+---+-------+---+-----+
| id| name| id|state|
+---+-------+---+-----+
| 1| x| 1| a|
| 2| y| 2| b|
| 3| z| 3| c|
+---+-------+---+-----+
Now, for example, we have these new dataFrames –
val df1 = Seq((1, "x"), (2, "y"), (3, "z")).toDF("id", "name")
val df2 = Seq((1, "a"), (2, "b")).toDF("id", "city")
val df3 = df1.join(df2, "id", "left_outer")
df3.show()
This will give output as shown below with one null value –
+---+-------+----+
| id| name|city|
+---+-------+----+
| 1| x| a|
| 2| y| b|
| 3| z|null|
+---+-------+----+
Right outer join
val rightOuterJoinDF = leftDF.join(rightDF, leftDF("id") === rightDF("id"),
"right_outer")
rightOuterJoinDF.show()
This code will output the following:
+---+-------+---+-----+
| id| name| id|state|
+---+-------+---+-----+
| 1| x| 1| a|
| 2| y| 2| b|
| 3| z| 3| c|
+---+-------+---+-----+
Now, for example, we have these new dataFrames –
val df1 = Seq((1, "x"), (2, "y")).toDF("id", "name")
val df2 = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "city")
val df3 = df1.join(df2, "id", "right_outer")
df3.show()
This will give output as shown below with one null value –
+---+-------+----+
| id| name|city|
+---+-------+----+
| 1| x| a|
| 2| y| b|
| 3| null| c|
+---+-------+----+
Full outer join
val fullOuterJoinDF = leftDF.join(rightDF, leftDF("id") === rightDF("id"),
"full_outer")
fullOuterJoinDF.show()
This code will output the following:
+---+-------+---+-----+
| id| name| id|state|
+---+-------+---+-----+
| 1| x| 1| a|
| 2| y| 2| b|
| 3| z| 3| c|
+---+-------+---+-----+
Now, for example, we have these new dataFrames –
val df1 = Seq((1, "x"), (2, "y"),(4,"z")).toDF("id", "name")
val df2 = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "city")
val df3 = df1.join(df2, df1("id") === df2("id"), "full_outer")
df3.show()
This code will output the following:
+---+-------+----+
| id| name|city|
+---+-------+----+
| 1| x| a|
| 2| y| b|
| 3| null| c|
+---+-------+----+
Cross join
val crossJoinDF = leftDF.crossJoin(rightDF)
crossJoinDF.show()
This code will output the following:
+---+-------+---+-----+
| id| name| id|state|
+---+-------+---+-----+
Now, for example, we have these new dataFrames –
val df1 = Seq((1, "x"), (2, "y"), (3, "z")).toDF("id", "name")
val df2 = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "city")
val df3 = df1.crossJoin(df2)
df3.show()
This code will output the following:
+---+-------+---+----+
| id| name| id|city|
+---+-------+---+----+
| 1| x| 1| a|
| 1| x| 2| b|
| 1| x| 3| c|
| 2| y| 1| a|
| 2| y| 2| b|
| 2| y| 3| c|
| 3| z| 1| a|
| 3| z| 2| b|
| 3| z| 3| c|
+---+-------+---+----+
Conclusion
This blog is only a simple approach for learning purposes. It will be used in plenty of useful places.For more clarification, visit SparkByExamples We will write some blogs on the same topic in future and we’ll see how to better use this tech to write sql queries on complicated methods.
If you want to add anything or you do not relate to my view on any point, drop me a comment. I will be happy to discuss it. For more blogs, click here

1 thought on “Joins in Spark SQL with examples5 min read”
Comments are closed.