What is RDD in Spark?
RDD stands for “Resilient Distributed Dataset”. RDD in Apache Spark is a Data structure, and also an immutable collection of objects computes on the different nodes of the cluster.
- Resilient, i.e. fault-tolerant, the data is present into multiple executable nodes so that in case of failure of any node it can get backup from another executable nodes.
- Distributed, since Data occupy on multiple nodes.
- Dataset represents records of the data. User can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
Need of RDD in Apache Spark
- Map Reduce technology was not suitable for Distributive Iterative Computation in complicated machine learning algorithms.
- Traditional technology was time consuming.
- They created major data redundancy by data replication and data serialization.
Features of Spark RDD
Features of Apache Spark RDD’s are:-
1. In memory computation:-
In this type of computation, the data is kept in RAM, not in disk drives and it also increases the computation speed of the system, and it also provides easy access to data. In-memory computation is also good for real-time processing and fraud detection.
2. Lazy Evaluations:-
Spark will lock all the transformations we apply on to it and will not provide any output on display until action is provoking.
Spark RDD is immutable because the access provided by RDD is just Read-Only and the only way to access and modify an RDD is to apply a transformation.
4. Fault Tolerance:-
RDD’s are fault-tolerant because any lost partition in RDD can be rolled back by applying simple transformations onto the lost partition in the lineage.
Users can state which RDDs they will reuse and choose a storage strategy for them.
Spark RDD can achieve parallelism using partitioning. Spark determines the number of parts in which data is divided and we can create a partition by applying some transformations to existing partitions.
7. Location Stickiness:-
RDDs are capable of defining placement preference to compute partitions. Placement preference refers to information about the location of RDD.
8. Coarse Grained Operations:-
Every operation that applies on RDD is Coarse-Grained Operations.
Spark RDD Operations
There are two types of operations in Apache Spark RDD:-
Transformations are lazy operations on an RDD in Apache Spark because it creates one or many new RDDs, which executes when an Action occurs. Hence, Transformation creates a new dataset from an existing one.
There are two kinds of transformations: narrow transformation, wide transformation.
It is the result of map, filter, and such that the data is from a single partition only, i.e. it is self-sufficient, and an output RDD has partitions with records that originate from a single partition in the parent RDD.
It is the result of groupByKey() and reduceByKey() like functions because the data required to compute the records in a single partition may live in many partitions of the parent RDD.
An Action in Spark returns the final result of RDD computations. It triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system. A lineage graph is a dependency graph of all parallel RDDs of RDD.
Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. An Action is one of the ways to send results from executors to the driver. First(), take(), reduce(), collect(), the count() is some of the Actions in spark.
Using transformations, one can create RDD from the existing one. But when we want to work with the actual dataset, at that point we use Action. When the Action occurs it does not create the new RDD, unlike transformation. Thus, actions are RDD operations that give no RDD values. Action stores its value either to drivers or to the external storage system. It brings the laziness of RDD into motion.
In conclusion to Spark RDD, the shortcomings of Hadoop MapReduce were overcome by Spark RDD by introducing functionality of in-memory processing, immutability, persistence, etc. But RDD also has some limitations, for example, No inbuilt optimization, storage and performance limitation, etc.
Because of the limitations of RDD to make spark more versatile, there is the concept of DataFrame and Dataset evolved.