Apache Spark : Spark Union adds up the partition of input RDDs


Some days back when I was doing union of 2 pair rdds, I found the strange behavior for the number of partitions.

The output RDD got different number of partition than input Rdd. For ex: suppose rdd1 and rdd2, each have 2 no of partitions and after union of these rdds I was expecting same no of partitions for output RDD, but the output RDD got the no of partitions as the sum of the partitions of input rdds.

After doing some research, I found that partitioner has an impact over this strange behavior. More details on partitioner

I found 2 cases which are decribbed below in which the above strange behavior can happen

Case 1 :

If partitioner is different for both input rdds, then the output rdd will have partitions as sum of the partitions of input rdd. To check partitioner, do the following :

1

Here different partitioner have 2 meaning :

  1. Either both input rdd have different partitioner i.e. Rdd1 has Hash partition and Rdd2 has Range partitioner.
  2. OR one rdd has partitioner and other one dont.

2

You can see in above example, rdd1 has HashPartitioner with 2 partitions while rdd2 does not have any partitioner with 4 no of partitions. After the union of these 2, the output rdd3 has 6 (2 + 4) partitions.

Case 2 :

If partitioner is same for both input rdds but no of partitions are different then also output rdd will get the partitions as sum of the partitions of input rdds.

3

As you can see in above example. Both input rdd, have same partitioner but with different no of partitions i..e rdd1 has 2 while rdd2 hase 3. So after union, output rdd3 has 5 (2 + 3) partitions.

Now, you can see below that what happens when both partitioner and no  of partitions are same for both input rdds.

4

In above example, output rdd3 has same no of partitions as input rdd.

That’s it !!!

Hope you enjoy the reading and get some knowledge for you.

Cheers !!!
 


KNOLDUS-advt-sticker

Advertisements

About Rishi Khandelwal

Sr. Software Consultant having more than 6 years industry experience. He has working experience in various technologies such as Scala, Java, Play, Akka, Spark, Hive, Cassandra, Akka-http, ElasticSearch, Backbone.js, html5, javascript, Less, Amazon EC2, WebRTC, SBT
This entry was posted in Agile, apache spark, Best Practices, big data, Scala. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s