Transformation with Examples: Spark RDDs

Reading Time: 3 minutes

Transformation is one of the RDD operation in spark before moving this first discuss about what actual Spark and RDD is.

What is Spark?

Apache Spark is an open-source cluster computing framework. Its main objective is to manage the data created in real time.

Hadoop MapReduce was the foundation upon which Spark was developed. Unlike competing methods like Hadoop’s MapReduce, which writes and reads data to and from computer hard drives, it was optimized to run in memory. As a result, Spark processes the data far more quickly than other options.

What is RDD?

The fundamental abstraction of Spark is the RDD (Resilient Distributed Dataset). It is a group of components that have been divided up across the cluster nodes so that we can process different parallel operations on it.

RDDs can be produced in one of two ways:

  • Parallelizing data in the driver program already in use.
  • Any data source that offers a Hadoop InputFormat, such as a shared filesystem, HDFS, HBase, or any other external storage system.

Spark RDD Operations

The RDD provides the two types of operations:

  • Transformations
  • Actions

A Transformation is a function that generates new RDDs from existing RDDs, but when we want to work with the actual dataset, we perform an Action. When the action is triggered after the result, a new RDD is not formed in the same way that transformation is.

Transformations with Examples

The role of transformation in Spark is to create a new dataset from an existing one. Lazy transformations are those that are computed only when an action requires a result to be returned to the driver programme.

When we call an action, transformations are executed since they are inherently lazy. Not right away are they carried out. There are two primary types of transformations: map() and filter ().
The outcome RDD is always distinct from the parent RDD after the transformation. It could be smaller (filter, count, distinct, sample, for example), bigger (flatMap(), union(), Cartesian()), or the same size (e.g. map).

In this section, I will explain a few RDD Transformations with word count example in scala, before we start first, let’s create an RDD by reading a text file. The text file used here is a dummy datasets you can use any datasets here.

val spark:SparkSession = SparkSession.builder()
      .master("local[3]")
      .appName("SparkByExamples.com")
      .getOrCreate()

val sc = spark.sparkContext

val rdd:RDD[String] = sc.textFile("src/main/scala/test.txt")

flatMap() Transformation

After applying the function, the flatMap() transformation flattens the RDD and creates a new RDD. The example below first divides each record in an RDD by space before flattening it. Each entry in the resulting RDD only contains one word.


val rdd2 = rdd.flatMap(f=>f.split(" "))

map() Transformation

Any complex actions, such as the addition of a column or the updating of a column, are applied using the map() transformation, and the output of these transformations always has the same amount of records as the input.

In our word count example, we are creating a new column and assigning a value of 1 to each word. The RDD produces a PairRDDFunction that has key-value pairs with the keys being words of type String and the values being 1 of type Int. I’ve defined the type of the rdd3 variable for your understanding.


val rdd3:RDD[(String,Int)]= rdd2.map(m=>(m,1))

filter() Transformation

The records in an RDD can be filtered with the filter() transformation. In our illustration, we are filtering out all terms that begin with “a.”


val rdd4 = rdd3.filter(a=> a._1.startsWith("a"))

reduceByKey() Transformation

The method supplied by reduceByKey() merges the values for each key. By using the sum function on value in our example, the word string is condensed. Our RDD’s output includes a count of the number of unique words.


val rdd5 = rdd3.reduceByKey(_ + _)

union(dataset)

We can obtain the elements from both RDDs in the new RDD using the union() function. The two RDDs must be of the same type in order for this function to work.
For instance, if RDD1’s elements are Spark, Spark, Hadoop, and Flink, and RDD2’s elements are Big data, Spark, and Flink, the resulting rdd1.union(rdd2) will have the following elements: Spark, Spark, Spark, Hadoop, Flink, and Flink, Big data.

val rdd6 = rdd5.union(rdd3)

intersection(other-dataset)

With the intersection() function, we get only the common element of both the RDD in new RDD. The key rule of this function is that the two RDDs should be of the same type.

val rdd7 = rdd1.intersection(rdd2)

Conclusion

In this Spark RDD Transformations blog, you have learned different transformation functions and their usage with scala examples. In the next blog, we will learn about actions.

Happy Learning !!

Written by 

Meenakshi Goyal is a Software Consultant and started her career in an environment and organization where her skills are challenged each day, resulting in ample learning and growth opportunities. Proficient in Scala, Akka, Akka HTTP , JAVA. Passionate about implementing and launching new projects. Ability to translate business requirements into technical solutions. Her hobbies are traveling and dancing.