Transformation is one of the RDD operation in spark before moving this first discuss about what actual Spark and RDD is.
What is Spark?
Apache Spark is an open-source cluster computing framework. Its main objective is to manage the data created in real time.
Hadoop MapReduce was the foundation upon which Spark was developed. Unlike competing methods like Hadoop’s MapReduce, which writes and reads data to and from computer hard drives, it was optimized to run in memory. As a result, Spark processes the data far more quickly than other options.
What is RDD?
The fundamental abstraction of Spark is the RDD (Resilient Distributed Dataset). It is a group of components that have been divided up across the cluster nodes so that we can process different parallel operations on it.
RDDs can be produced in one of two ways:
- Parallelizing data in the driver program already in use.
- Any data source that offers a Hadoop InputFormat, such as a shared filesystem, HDFS, HBase, or any other external storage system.
Spark RDD Operations
The RDD provides the two types of operations:
A Transformation is a function that generates new RDDs from existing RDDs, but when we want to work with the actual dataset, we perform an Action. When the action is triggered after the result, a new RDD is not formed in the same way that transformation is.
Transformations with Examples
The role of transformation in Spark is to create a new dataset from an existing one. Lazy transformations are those that are computed only when an action requires a result to be returned to the driver programme.
When we call an action, transformations are executed since they are inherently lazy. Not right away are they carried out. There are two primary types of transformations: map() and filter ().
The outcome RDD is always distinct from the parent RDD after the transformation. It could be smaller (filter, count, distinct, sample, for example), bigger (flatMap(), union(), Cartesian()), or the same size (e.g. map).
In this section, I will explain a few RDD Transformations with word count example in scala, before we start first, let’s create an RDD by reading a text file. The text file used here is a dummy datasets you can use any datasets here.
val spark:SparkSession = SparkSession.builder() .master("local") .appName("SparkByExamples.com") .getOrCreate() val sc = spark.sparkContext val rdd:RDD[String] = sc.textFile("src/main/scala/test.txt")
After applying the function, the flatMap() transformation flattens the RDD and creates a new RDD. The example below first divides each record in an RDD by space before flattening it. Each entry in the resulting RDD only contains one word.
val rdd2 = rdd.flatMap(f=>f.split(" "))
Any complex actions, such as the addition of a column or the updating of a column, are applied using the map() transformation, and the output of these transformations always has the same amount of records as the input.
In our word count example, we are creating a new column and assigning a value of 1 to each word. The RDD produces a PairRDDFunction that has key-value pairs with the keys being words of type String and the values being 1 of type Int. I’ve defined the type of the rdd3 variable for your understanding.
val rdd3:RDD[(String,Int)]= rdd2.map(m=>(m,1))
The records in an RDD can be filtered with the filter() transformation. In our illustration, we are filtering out all terms that begin with “a.”
val rdd4 = rdd3.filter(a=> a._1.startsWith("a"))
The method supplied by reduceByKey() merges the values for each key. By using the sum function on value in our example, the word string is condensed. Our RDD’s output includes a count of the number of unique words.
val rdd5 = rdd3.reduceByKey(_ + _)
We can obtain the elements from both RDDs in the new RDD using the union() function. The two RDDs must be of the same type in order for this function to work.
For instance, if RDD1’s elements are Spark, Spark, Hadoop, and Flink, and RDD2’s elements are Big data, Spark, and Flink, the resulting rdd1.union(rdd2) will have the following elements: Spark, Spark, Spark, Hadoop, Flink, and Flink, Big data.
val rdd6 = rdd5.union(rdd3)
With the intersection() function, we get only the common element of both the RDD in new RDD. The key rule of this function is that the two RDDs should be of the same type.
val rdd7 = rdd1.intersection(rdd2)
In this Spark RDD Transformations blog, you have learned different transformation functions and their usage with scala examples. In the next blog, we will learn about actions.
Happy Learning !!