Apache Spark : Spark Union adds up the partition of input RDDs

Reading Time: 2 minutes

Some days back when I was doing union of 2 pair rdds, I found the strange behavior for the number of partitions.

The output RDD got different number of partition than input Rdd. For ex: suppose rdd1 and rdd2, each have 2 no of partitions and after union of these rdds I was expecting same no of partitions for output RDD, but the output RDD got the no of partitions as the sum of the partitions of input rdds.

After doing some research, I found that partitioner has an impact over this strange behavior. More details on partitioner

I found 2 cases which are decribbed below in which the above strange behavior can happen

Case 1 :

If partitioner is different for both input rdds, then the output rdd will have partitions as sum of the partitions of input rdd. To check partitioner, do the following :

1

Here different partitioner have 2 meaning :

  1. Either both input rdd have different partitioner i.e. Rdd1 has Hash partition and Rdd2 has Range partitioner.
  2. OR one rdd has partitioner and other one dont.

2

You can see in above example, rdd1 has HashPartitioner with 2 partitions while rdd2 does not have any partitioner with 4 no of partitions. After the union of these 2, the output rdd3 has 6 (2 + 4) partitions.

Case 2 :

If partitioner is same for both input rdds but no of partitions are different then also output rdd will get the partitions as sum of the partitions of input rdds.

3

As you can see in above example. Both input rdd, have same partitioner but with different no of partitions i..e rdd1 has 2 while rdd2 hase 3. So after union, output rdd3 has 5 (2 + 3) partitions.

Now, you can see below that what happens when both partitioner and no  of partitions are same for both input rdds.

4

In above example, output rdd3 has same no of partitions as input rdd.

That’s it !!!

Hope you enjoy the reading and get some knowledge for you.

Cheers !!!


KNOLDUS-advt-sticker

Written by 

Rishi is a tech enthusiast with having around 10 years of experience who loves to solve complex problems with pure quality. He is a functional programmer and loves to learn new trending technologies. His leadership skill is well prooven and has delivered multiple distributed applications with high scalability and availability by keeping the Reactive principles in mind. He is well versed with Scala, Akka, Akka HTTP, Akka Streams, Java8, Reactive principles, Microservice architecture, Async programming, functional programming, distributed systems, AWS, docker.