Controlling RDD Partitions in Apache Spark


In this blog, we will discuss What is RDD partitioning, why Partitioning is important and how to create and use spark Partitioners to minimize the shuffle operations across the nodes in a distributed Spark application.

What is Partitioning?

Partitioning is a transformation operation which is available on all key value pair RDDs  in Apache Spark. It is required when we try to group values on the basis of similarity of their keys. The similarity of keys can be defined by a function.

part1

Why is it Important?

Partitioning has great importance when working with key value pair RDDs. For example aggregating values for certain keys in a distributed RDD element would required fetching values from other nodes before computing final aggregate result per key. That is what we call a “Shuffle Operation” in spark. Locality of a group of elements with similar values on the same node reduces communication costs.

How to apply a Partitioner on pair RDD

Transformation .partitionBy is used to specify a partition on a pair RDD specifying numOfPartitions in the Partitioner object.

part2

Creating a pair RDD of [Int,String] and partitioning it by HashPartitioner with 3 partitions. The HashPartitioner groups similar values in same partition according to the function.

mod(key.hashCode, numOfPartition)

part4

Let us see the result of partition by looking into count of values into each partition. Here number of partitions in HashPartitioner is 3 and total values in RDD is 5. After counting the values into each partition we get:

part6

When we get benefits of Partitioning

When an RDD is partitioned by the previous transformation with the same Partitioner, the shuffle will be avoided on at least one RDD and will reduce communication cost. Following is the list of some of the transformations which will benefit from pre-partitioned RDDs.

  • join()
  • cogroup()
  • groupWith()
  • leftOuterJoin()
  • rightOuterJoin()
  • groupByKey()
  • reduceByKey()
  • combineByKey()

When a Partition information is not preserved?

Not all transformation preserves the partition information. Calling map() on a hash-Partitioned key-value RDD, it is not guaranteed that the partition information will available in resultant RDD, since the function passed in a map transformation can change the key of elements in the RDD and that would be inconsistent.

How to preserve Partition Information?

Using transformations mapValues() and flatMapValues() instead of map() and flatMap() on a pair RDD will only transform the values of RDDs keeping keys intact. These transformations only pass values to your function definition.

Advertisements

About Manish Mishra

Manish is a Scala Developer at Knoldus Software LLP. He loves to learn and share about Functional Programming, Scala, Akka, Spark.
This entry was posted in apache spark, Scala, Spark and tagged , , , , , . Bookmark the permalink.

One Response to Controlling RDD Partitions in Apache Spark

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s