Some days back when I was doing union of 2 pair rdds, I found the strange behavior for the number of partitions.
The output RDD got different number of partition than input Rdd. For ex: suppose rdd1 and rdd2, each have 2 no of partitions and after union of these rdds I was expecting same no of partitions for output RDD, but the output RDD got the no of partitions as the sum of the partitions of input rdds.
After doing some research, I found that partitioner has an impact over this strange behavior. More details on partitioner
I found 2 cases which are decribbed below in which the above strange behavior can happen
Case 1 :
If partitioner is different for both input rdds, then the output rdd will have partitions as sum of the partitions of input rdd. To check partitioner, do the following :
Here different partitioner have 2 meaning :
- Either both input rdd have different partitioner i.e. Rdd1 has Hash partition and Rdd2 has Range partitioner.
- OR one rdd has partitioner and other one dont.
You can see in above example, rdd1 has HashPartitioner with 2 partitions while rdd2 does not have any partitioner with 4 no of partitions. After the union of these 2, the output rdd3 has 6 (2 + 4) partitions.
Case 2 :
If partitioner is same for both input rdds but no of partitions are different then also output rdd will get the partitions as sum of the partitions of input rdds.
As you can see in above example. Both input rdd, have same partitioner but with different no of partitions i..e rdd1 has 2 while rdd2 hase 3. So after union, output rdd3 has 5 (2 + 3) partitions.
Now, you can see below that what happens when both partitioner and no of partitions are same for both input rdds.
In above example, output rdd3 has same no of partitions as input rdd.
That’s it !!!
Hope you enjoy the reading and get some knowledge for you.