Joins in Spark SQL with examples

man working on laptop while woman takes notes
Reading Time: 4 minutes

Spark SQL

Spark SQL is a module in Apache Spark. It allows users to process structured data using a SQL-like syntax. It integrates seamlessly with the Spark ecosystem, including Spark Streaming and MLlib.

One of the main benefits of using Spark SQL is that it permits to users to integrate SQL queries with the programming language of their choice, such as Scala, Python, or Java. This makes it easy for data engineers to work with structured data in Spark, as they can use the same SQL syntax that they would use in an old database.

Types of Joins in Spark SQL

In Apache Spark, we can use the following types of joins in SQL:

Inner join: An inner join in Apache Spark is a type of join that returns only the rows that match a given predicate in both tables. To perform an inner join in Spark using Scala, we can use the join method on a DataFrame.

val innerJoinDF = leftDF.join(rightDF, leftDF("key") === rightDF("key"), 

"inner")

Left outer join: A left outer join in Apache Spark is a type of join that returns all the rows from the left DataFrame, as well as any matching rows from the right DataFrame. If there is no match, the right-side columns will be filled with null values. To perform a left outer join in Spark using Scala, we can use the join method on a DataFrame.

val leftOuterJoinDF = leftDF.join(rightDF, leftDF("key") === rightDF("key"), 

"left_outer")

Right outer join: Right outer join returns all rows from the right table, and any matching rows from the left table. If there is no match, null values are returned for left table’s columns.

val rightOuterJoinDF = leftDF.join(rightDF, leftDF("key") === rightDF("key"), 

"right_outer")

Full outer join: Full outer join returns all rows from both tables, and null values are returned for any missing matches on either side.

val fullOuterJoinDF = leftDF.join(rightDF, leftDF("key") === rightDF("key"), 

"full_outer")

Cross join: A cross join in Apache Spark is a type of join that returns the cross product of two tables, i.e., every possible combination of rows from both tables. To perform a cross join in Spark using Scala, we can use the crossJoin method on a DataFrame.

val crossJoinDF = leftDF.crossJoin(rightDF)

Follow With Code

Let’s understand these with some detailed examples. Here we are creating a spark session and two data frames which we will use to understand spark joins :

import org.apache.spark.sql.SparkSession

// Create a SparkSession

val spark = SparkSession.builder().appName("Joins").getOrCreate()

// Set the log level to ERROR

spark.sparkContext.setLogLevel("ERROR")

// Create the left and right DataFrames

val leftDF = spark.createDataFrame(Seq((1, "x"), (2, "y"), (3,

 "z"))).toDF("id", "name")

val rightDF = spark.createDataFrame(Seq((1, "a"), (2, "b"), (3, 

"c"))).toDF("id", "state")

Inner join

val innerJoinDF = leftDF.join(rightDF, leftDF("id") === rightDF("id"), 

"inner")

innerJoinDF.show()

This code will output the following:

+---+-------+---+-----+
| id|   name| id|state|
+---+-------+---+-----+
|  1|      x|  1|    a|
|  2|      y|  2|    b|
|  3|      z|  3|    c|
+---+-------+---+-----+

Left outer join

val leftOuterJoinDF = leftDF.join(rightDF, leftDF("id") === rightDF("id"), 

"left_outer")

leftOuterJoinDF.show()

This code will output the following:

+---+-------+---+-----+
| id|   name| id|state|
+---+-------+---+-----+
|  1|      x|  1|    a|
|  2|      y|  2|    b|
|  3|      z|  3|    c|
+---+-------+---+-----+

Now, for example, we have these new dataFrames –

val df1 = Seq((1, "x"), (2, "y"), (3, "z")).toDF("id", "name")

val df2 = Seq((1, "a"), (2, "b")).toDF("id", "city")

val df3 = df1.join(df2, "id", "left_outer")

df3.show()

This will give output as shown below with one null value –

+---+-------+----+
| id|   name|city|
+---+-------+----+
|  1|      x|   a|
|  2|      y|   b|
|  3|      z|null|
+---+-------+----+

Right outer join

val rightOuterJoinDF = leftDF.join(rightDF, leftDF("id") === rightDF("id"), 

"right_outer")

rightOuterJoinDF.show()

This code will output the following:

+---+-------+---+-----+
| id|   name| id|state|
+---+-------+---+-----+
|  1|      x|  1|    a|
|  2|      y|  2|    b|
|  3|      z|  3|    c|
+---+-------+---+-----+

Now, for example, we have these new dataFrames –

val df1 = Seq((1, "x"), (2, "y")).toDF("id", "name")

val df2 = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "city")

val df3 = df1.join(df2, "id", "right_outer")

df3.show()

This will give output as shown below with one null value –

+---+-------+----+
| id|   name|city|
+---+-------+----+
|  1|      x|   a|
|  2|      y|   b|
|  3|   null|   c|
+---+-------+----+

Full outer join

val fullOuterJoinDF = leftDF.join(rightDF, leftDF("id") === rightDF("id"), 

"full_outer")

fullOuterJoinDF.show()

This code will output the following:

+---+-------+---+-----+
| id|   name| id|state|
+---+-------+---+-----+
|  1|      x|  1|    a|
|  2|      y|  2|    b|
|  3|      z|  3|    c|
+---+-------+---+-----+

Now, for example, we have these new dataFrames –

val df1 = Seq((1, "x"), (2, "y"),(4,"z")).toDF("id", "name")

val df2 = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "city")

val df3 = df1.join(df2, df1("id") === df2("id"), "full_outer")
df3.show()

This code will output the following:

+---+-------+----+
| id|   name|city|
+---+-------+----+
|  1|      x|   a|
|  2|      y|   b|
|  3|   null|   c|
+---+-------+----+

Cross join

val crossJoinDF = leftDF.crossJoin(rightDF)

crossJoinDF.show()

This code will output the following:

+---+-------+---+-----+
| id|   name| id|state|
+---+-------+---+-----+ 

Now, for example, we have these new dataFrames –

val df1 = Seq((1, "x"), (2, "y"), (3, "z")).toDF("id", "name")

val df2 = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "city")

val df3 = df1.crossJoin(df2)

df3.show()

This code will output the following:

+---+-------+---+----+
| id|   name| id|city|
+---+-------+---+----+
|  1|      x|  1|   a|
|  1|      x|  2|   b|
|  1|      x|  3|   c|
|  2|      y|  1|   a|
|  2|      y|  2|   b|
|  2|      y|  3|   c|
|  3|      z|  1|   a|
|  3|      z|  2|   b|
|  3|      z|  3|   c|
+---+-------+---+----+

Conclusion

This blog is only a simple approach for learning purposes. It will be used in plenty of useful places.For more clarification, visit SparkByExamples We will write some blogs on the same topic in future and we’ll see how to better use this tech to write sql queries on complicated methods.

If you want to add anything or you do not relate to my view on any point, drop me a comment. I will be happy to discuss it. For more blogs, click here

Written by 

Rituraj Khare is a Software Consultant at Knoldus Software LLP. An avid Scala programmer and Big Data engineer, he has experience with the tech stack such as - Scala| Spark| Kafka| Python| Unit testing| Git| Jenkins| Grafana.

Leave a Reply