Controlling RDD Partitions in Apache Spark

Reading Time: 2 minutes

In this blog, we will discuss What is RDD partitioning, why Partitioning is important and how to create and use spark Partitioners to minimize the shuffle operations across the nodes in a distributed Spark application.

What is Partitioning?

Partitioning is a transformation operation which is available on all key value pair RDDs  in Apache Spark. It is required when we try to group values on the basis of similarity of their keys. The similarity of keys can be defined by a function.

Partitioning

Why is it Important?

Partitioning has great importance when working with key value pair RDDs. For example aggregating values for certain keys in a distributed RDD element would required fetching values from other nodes before computing final aggregate result per key. That is what we call a “Shuffle Operation” in spark. Locality of a group of elements with similar values on the same node reduces communication costs.

How to apply a Partitioner on pair RDD

Transformation .partitionBy is used to specify a partition on a pair RDD specifying numOfPartitions in the Partitioner object.

Partitioning

Creating a pair RDD of [Int,String] and partitioning it by HashPartitioner with 3 partitions. The HashPartitioner groups similar values in same partition according to the function.

Partitioning

Let us see the result of partition by looking into count of values into each partition. Here number of partitions in HashPartitioner is 3 and total values in RDD is 5. After counting the values into each partition we get:

part6

When we get benefits of Partitioning

When an RDD is partitioned by the previous transformation with the same Partitioner, the shuffle will be avoided on at least one RDD and will reduce communication cost. Following is the list of some of the transformations which will benefit from pre-partitioned RDDs.

  • join()
  • cogroup()
  • groupWith()
  • leftOuterJoin()
  • rightOuterJoin()
  • groupByKey()
  • reduceByKey()
  • combineByKey()

When a Partition information is not preserved?

Not all transformation preserves the partition information. Calling map() on a hash-Partitioned key-value RDD, it is not guaranteed that the partition information will available in resultant RDD, since the function passed in a map transformation can change the key of elements in the RDD and that would be inconsistent.

How to preserve Partition Information?

Using transformations mapValues() and flatMapValues() instead of map() and flatMap() on a pair RDD will only transform the values of RDDs keeping keys intact. These transformations only pass values to your function definition.

Written by 

Manish Mishra is Lead Software Consultant, with experience of more than 7 years. His primary development technology was Java. He fell for Scala language and found it innovative and interesting language and fun to code with. He has also co-authored a journal paper titled: Economy Driven Real Time Deadline Based Scheduling. His interests include: learning cloud computing products and technologies, algorithm designing. He finds Books and literature as favorite companions in solitude. He likes stories, Spiritual Fictions and Time Traveling fictions as his favorites.

1 thought on “Controlling RDD Partitions in Apache Spark2 min read

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading