Reading Time: 6 minutes

When talking of working in Spark, Key/Value paired RDDs is intuitive. This blog is just going to demonstrate the working with Pair RDDs in Apache Spark.

Image result for key value pair rdd in spark

If you want to know more about the basic RDDs, you can read another blog having some basic understanding of RDDs.

So, assuming that you have a fair idea about what Spark is and the basics of RDDs. Paired RDD is one of the kinds of RDDs. These RDDs contain the key/value pairs of data. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.

When datasets are described in terms of key/value pairs, it is common to want to
aggregate statistics across all elements with the same key.

Transformations on Pair RDDs

Whatever transformations are available for standard RDD, it will be there for Pair RDDs too, the only difference is we need to pass functions that operate on tuples rather than on individual elements. Some of the examples are map()reduce()filter(). Now what new transformations, Pair RDDs provide us with? Let’s try some of those transformations:

  1. reduceByKey()
    It runs several parallel reduce operations, one for each key in the dataset, where each operation combines values that have the same key. Data are combined at each partition, only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.

Image result for reduce by key in spark

Note: Since datasets can have very large numbers of keys, reduceByKey() is not implemented as an action that returns a value to the user program. Instead, it returns a new RDD consisting of each key and the reduced value for that key.

val bags = sc.parallelize(List(("Rosso brunello",1),("Aldo",3),("Dune London", 1),("Michael Kors",2), ("Dkny",2),("Rosso brunello",3),("Dune London",1),("Michael Kors",3)))
 res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at:27
 res5: Array[(String, Int)] = Array((Dune London,2), (Rosso brunello,4), (Aldo,3), (Michael Kors,5), (Dkny,2))

2. groupByKey()

On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key-value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.While both reducebykey and groupbykey will produce the same answer, the reduceByKey example works much better on a large dataset. That’s because Spark knows it can combine output with a common key on each partition before shuffling the data. On the other hand, when calling groupByKey – all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.
So, we should avoid groupByKey and rather use reduceByKey.

Screenshot from 2019-08-25 01-37-08

3. foldByKey()

foldByKey operation is used to aggregate values based on keys. It is quite similar to fold(); both use a zero value of the same type of data in our RDD and combination function. As with fold(), the provided zero value for foldByKey() should have no impact when added with your combination function to another element.

Screenshot from 2019-08-25 18-30-45

4. mapValues(func)

mapValues is similar to map except the former is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value). There might be the cases where we’d be interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

Screenshot from 2019-08-25 19-19-01

5. AggregateByKey()

aggregateByKey() will combine the values for a particular key, and the result of such combination can be any object that you specify. You have to specify how the values are combined (“added”) inside one partition (that is executed in the same node) and how you combine the result from different partitions (that may be in different nodes). reduceByKey is a particular case, in the sense that the result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions, is also the same as the operation when combining values inside a partition. The following example shows first where the same operation is used for everything, i.e. using reduceByKey. then, while using aggregateByKey, 0 is initial value, _+_ inside a partition, _+_ between partitions.

Screenshot from 2019-08-25 23-03-11

Now, imagine that you want the aggregation to be a Set of the values, that is a different type that the values, that are integers (the sum of integers is also integers):

Screenshot from 2019-08-25 23-09-15

the initial value is a void Set. Adding an element to a set is the first
+ Join two sets is the ++

6. join()
A query that accesses multiple rows of the same or different tables at one time is called a join query. Inner joins require a key to be present in both RDDs whereas
Outer joins do not require a key to be present in both RDDs.

Screenshot from 2019-08-26 10-32-14

All keys that will appear in the final result is common to rdd1 and rdd2. This is similar to the relation database operation INNER JOIN.

Screenshot from 2019-08-26 13-47-43

With leftOuterJoin() the resulting pair RDD has entries for each key in the source
RDD. The value associated with each key in the result is a tuple of the value from the source RDD and an Option (or Optional in Java) for the value from the other pair RDD.

Screenshot from 2019-08-26 13-48-12

rightOuterJoin() is almost identical to leftOuterJoin() except the key must be present in the other RDD and the tuple has an option for the source rather than the other RDD.

Screenshot from 2019-08-26 13-48-34

7. cogroup()
Given two RDDs sharing the same key type K, with the types of the respective value as V and W, the resulting RDD is of type [K, (iterable[V], Iterable[W])], as one key at least appear in either of the two RDDs, it will appear in the final result.

Screenshot from 2019-08-25 23-50-02

This is very similar to relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the iterable interface to you, the following operation is up to you as convenient!

Screenshot from 2019-08-25 23-48-56

8. sortByKey()
sortByKey() is part of OrderedRDDFunctions that works on Key/Value pairs.
It receives key-value pairs (K, V) as an input, sorts the elements in ascending or descending order and generates a dataset in an order.

Screenshot from 2019-08-25 23-53-11

Actions on Pair RDDs

Just like transformation, all the actions are that are available for the base RDD are available for pair RDDs. Some additional actions are available for pair RDDs to take advantage of the key/value nature of the data. They are:

  1. countByKey() 
    – Counts the number of elements for each key.
  2. collectAsMap()
    – Collects the result as a map to provide easy lookup.
  3. lookup(key)
    – Returns all values associated with the provided key.
Screenshot from 2019-08-25 23-31-46

So, these were some of the transformations and actions on paired RDDs in Spark. I hope that would help you all! 🙂


  1. Learning Spark – Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia

Written by 

Ramandeep Kaur is a Software Consultant, having experience of more than 1.5 years. She is a Java enthusiast and has knowledge of languages like C, C++, C#, and Scala. She is familiar with Object Oriented Programming Paradigms and also has a great interest in relational database technologies. Her hobbies include reading novels and listening music.

1 thought on “SPARK: WORKING WITH PAIRED RDDS6 min read

Leave a Reply

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!