When talking of working in Spark, Key/Value paired RDDs is intuitive. This blog is just going to demonstrate the working with Pair RDDs in Apache Spark.
If you want to know more about the basic RDDs, you can read another blog having some basic understanding of RDDs.
So, assuming that you have a fair idea about what Spark is and the basics of RDDs. Paired RDD is one of the kinds of RDDs. These RDDs contain the key/value pairs of data. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.
When datasets are described in terms of key/value pairs, it is common to want to
aggregate statistics across all elements with the same key.
Transformations on Pair RDDs
Whatever transformations are available for standard RDD, it will be there for Pair RDDs too, the only difference is we need to pass functions that operate on tuples rather than on individual elements. Some of the examples are map(), reduce(), filter(). Now what new transformations, Pair RDDs provide us with? Let’s try some of those transformations:
It runs several parallel reduce operations, one for each key in the dataset, where each operation combines values that have the same key. Data are combined at each partition, only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.
Note: Since datasets can have very large numbers of keys, reduceByKey() is not implemented as an action that returns a value to the user program. Instead, it returns a new RDD consisting of each key and the reduced value for that key.
val bags = sc.parallelize(List(("Rosso brunello",1),("Aldo",3),("Dune London", 1),("Michael Kors",2), ("Dkny",2),("Rosso brunello",3),("Dune London",1),("Michael Kors",3)))
bags.reduceByKey res1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at:27 res1.collect res5: Array[(String, Int)] = Array((Dune London,2), (Rosso brunello,4), (Aldo,3), (Michael Kors,5), (Dkny,2))
On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key-value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.While both reducebykey and groupbykey will produce the same answer, the reduceByKey example works much better on a large dataset. That’s because Spark knows it can combine output with a common key on each partition before shuffling the data. On the other hand, when calling groupByKey – all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.
So, we should avoid groupByKey and rather use reduceByKey.
foldByKey operation is used to aggregate values based on keys. It is quite similar to fold(); both use a zero value of the same type of data in our RDD and combination function. As with fold(), the provided zero value for foldByKey() should have no impact when added with your combination function to another element.
mapValues is similar to map except the former is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value). There might be the cases where we’d be interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().
aggregateByKey() will combine the values for a particular key, and the result of such combination can be any object that you specify. You have to specify how the values are combined (“added”) inside one partition (that is executed in the same node) and how you combine the result from different partitions (that may be in different nodes). reduceByKey is a particular case, in the sense that the result of the combination (e.g. a sum) is of the same type that the values, and that the operation when combined from different partitions, is also the same as the operation when combining values inside a partition. The following example shows first where the same operation is used for everything, i.e. using reduceByKey. then, while using aggregateByKey, 0 is initial value, _+_ inside a partition, _+_ between partitions.
Now, imagine that you want the aggregation to be a Set of the values, that is a different type that the values, that are integers (the sum of integers is also integers):
the initial value is a void Set. Adding an element to a set is the first
+ Join two sets is the ++
A query that accesses multiple rows of the same or different tables at one time is called a join query. Inner joins require a key to be present in both RDDs whereas
Outer joins do not require a key to be present in both RDDs.
All keys that will appear in the final result is common to rdd1 and rdd2. This is similar to the relation database operation INNER JOIN.
With leftOuterJoin() the resulting pair RDD has entries for each key in the source
RDD. The value associated with each key in the result is a tuple of the value from the source RDD and an Option (or Optional in Java) for the value from the other pair RDD.
rightOuterJoin() is almost identical to leftOuterJoin() except the key must be present in the other RDD and the tuple has an option for the source rather than the other RDD.
Given two RDDs sharing the same key type K, with the types of the respective value as V and W, the resulting RDD is of type [K, (iterable[V], Iterable[W])], as one key at least appear in either of the two RDDs, it will appear in the final result.
This is very similar to relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the iterable interface to you, the following operation is up to you as convenient!
sortByKey() is part of OrderedRDDFunctions that works on Key/Value pairs.
It receives key-value pairs (K, V) as an input, sorts the elements in ascending or descending order and generates a dataset in an order.
Actions on Pair RDDs
Just like transformation, all the actions are that are available for the base RDD are available for pair RDDs. Some additional actions are available for pair RDDs to take advantage of the key/value nature of the data. They are:
– Counts the number of elements for each key.
– Collects the result as a map to provide easy lookup.
– Returns all values associated with the provided key.
So, these were some of the transformations and actions on paired RDDs in Spark. I hope that would help you all! 🙂
- Learning Spark – Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia