Different Types of JOIN in Spark SQL

Reading Time: 3 minutes

Join in Spark SQL is the functionality to join two or more datasets that are similar to the table join in SQL based databases. Spark works as the tabular form of datasets and data frames. The Spark SQL supports several types of joins such as inner join, cross join, left outer join, right outer join, full outer join, left semi-join, left anti join. Joins scenarios are implemented in Spark SQL based upon the business use case. Some of the joins require high resource and computation efficiency. For managing such scenarios spark support SQL optimizer and cross join enabler flags features.

Types of Join in Spark SQL

Following are the different types of Joins:

  • INNER JOIN
  • LEFT OUTER JOIN
  • RIGHT OUTER JOIN
  • FULL OUTER JOIN
  • LEFT SEMI JOIN
  • LEFT ANTI JOIN

Example of Data Creation

We will use the following data to demonstrate the different types of joins:

Book Dataset:

case class Book(book_name: String, cost: Int, writer_id:Int)
val bookDS = Seq(
Book("Scala", 400, 1),
Book("Spark", 500, 2),
Book("Kafka", 300, 3),
Book("Java", 350, 5)
).toDS()
bookDS.show()

Book Dataset Joins in Spark SQL

Writer Dataset:

case class Writer(writer_name: String, writer_id:Int)
val writerDS = Seq(
Writer("Martin",1),
Writer("Zaharia " 2),
Writer("Neha", 3),
Writer("James", 4)
).toDS()
writerDS.show()

Writer Dataset joins in spark SQL

Types of Joins

Below are mentioned 7 different types of Joins:

1. INNER JOIN

The INNER JOIN returns the dataset which has the rows that have matching values in both the datasets i.e. value of the common field will be the same.

val BookWriterInner = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "inner")
BookWriterInner.show()

INNER JOIN Joins in Spark SQL

2. LEFT OUTER JOIN

The LEFT OUTER JOIN returns the dataset that has all rows from the left dataset, and the matched rows from the right dataset.

val BookWriterLeft = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "leftouter")
BookWriterLeft.show()

LEFT OUTER JOIN Joins in Spark SQL

3. RIGHT OUTER JOIN

The RIGHT OUTER JOIN returns the dataset that has all rows from the right dataset, and the matched rows from the left dataset.

val BookWriterRight = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "rightouter")
BookWriterRight.show()

RIGHT OUTER JOIN Joins in Spark SQL

4. FULL OUTER JOIN

The FULL OUTER JOIN returns the dataset that has all rows when there is a match in either the left or right dataset.

val BookWriterFull = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "fullouter")
BookWriterFull.show()

FULL OUTER JOIN

5. LEFT SEMI JOIN

The LEFT SEMI JOIN returns the dataset which has all rows from the left dataset having their correspondence in the right dataset. Unlike the LEFT OUTER JOIN, the returned dataset in LEFT SEMI JOIN contains only the columns from the left dataset.

val BookWriterLeftSemi = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "leftsemi")
BookWriterLeftSemi.show()

LEFT SEMI JOIN

6. LEFT ANTI JOIN

The ANTI SEMI JOIN returns the dataset which has all the rows from the left dataset that don’t have their matching in the right dataset. It also contains only the columns from the left dataset.

val BookWriterLeftAnti = bookDS.join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "leftanti")
BookWriterLeftAnti.show()

LEFT ANTI JOIN

Conclusion

Joining data is one of the most common and important operations for fulfilling our business use case. Spark SQL supports all the fundamental types of joins. While joining, we need to also consider performance as they may require large network transfers or even create datasets beyond our capability to handle.

Keep Learning 🙂

Written by 

Meenakshi Goyal is a Software Consultant and started her career in an environment and organization where her skills are challenged each day, resulting in ample learning and growth opportunities. Proficient in Scala, Akka, Akka HTTP , JAVA. Passionate about implementing and launching new projects. Ability to translate business requirements into technical solutions. Her hobbies are traveling and dancing.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading