What is RDD in Spark?
RDD stands for Resilient Distributed Dataset. Spark RDD is the backbone of Apache Spark. That’s why RDD is a fundamental data structure of Apache Spark and RDD is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Each and every dataset in Spark RDD is logically partitioned across many servers so that we can compute the dataset on different nodes of the cluster.
Decomposing the name RDD:
- Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so it is able to recompute missing or damaged partitions due to node failures.
- Distributed, since the Data resides on multiple nodes.
- Dataset represents records of the data you work with and the user can load the data set externally which can be either JSON file, CSV file, text file, or database via JDBC with no specific data structure.
Spark makes use of the concept of RDD to achieve faster and more efficient MapReduce operations.
Features of Spark RDD
i. In-memory Computation
Spark RDDs have a provision of in-memory computation. It stores intermediate results in distributed memory(RAM) instead of stable storage(disk).
ii. Lazy Evaluations
By using lazy operations we can implement Transformations in RDDs. In lazy evaluation, we can’t compute results immediately. The generation of results will be depending upon the trigger of action. Thus, the performance of the program is increased. transformations in Apache Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base data set.
Spark computes transformations when an action requires a result for the driver program. Follow this guide for the deep study of Spark Lazy Evaluation.
iii. Fault Tolerance
Spark RDDs are fault-tolerant as they track data lineage information to rebuild lost data automatically on failure. They rebuild lost data on failure using lineage, each RDD remembers how it was created from other datasets (by transformations like a map, join, or groupBy) to recreate itself. Follow this guide for the deep study of RDD Fault Tolerance.
The important fact about RDD is, it is immutable. You cannot change the state of RDD. If you want to change the state of RDD, you need to create a copy of the existing RDD and perform your required operations. Hence, the required RDD can be retrieved at any time.
Data items in RDDs are usually huge. This data is partitioned and sent across different nodes for distributed computing.
Intermediate results generated by RDD are stored to make the computation easy. It makes the process optimized.
vii. Coarse-grained Operations
It applies to all elements in datasets through maps or filter or group by operation.
Spark RDD Operations
RDD in Apache Spark supports two types of operations:
These are functions that accept the existing RDDs as input and output one or more RDDs. However, the data in the existing RDD in Spark does not change as it is immutable. Some of the transformation operations are provided in the table below:
|map()||Returns a new RDD by applying the function on each data element|
|filter()||Returns a new RDD formed by selecting those elements of the source on which the function returns true|
|reduceByKey()||Aggregates the values of a key using a function|
|groupByKey()||Converts a (key, value) pair into a (key, <iterable value>) pair|
|union()||Returns a new RDD that contains all elements and arguments from the source RDD|
|intersection()||Returns a new RDD that contains an intersection of the elements in the datasets|
These transformations are executed when they are invoked or called. Every time transformations are applied, a new RDD is created.
Actions in Spark are functions that return the end result of RDD computations. It uses a lineage graph to load data onto the RDD in a particular order. After all of the transformations are done, actions return the final result to the Spark Driver. Actions are operations that provide non-RDD values. Some of the common actions used in Spark are given below:
|count()||Gets the number of data elements in an RDD|
|collect()||Gets all the data elements in an RDD as an array|
|reduce()||Aggregates data elements into an RDD by taking two arguments and returning one|
|take(n)||Fetches the first n elements of an RDD|
|foreach(operation)||Executes the operation for each data element in an RDD|
|first()||Retrieves the first data element of an RDD|
When to use RDDs?
RDD is preferred to use when you want to apply low-level transformations and actions. It gives you a greater handle and control over your data. We can use RDD when the data is highly unstructured such as media or text streams. These are used when you want to add functional programming constructs rather than domain-specific expressions. RDDs are used in situations where the schema is not applied.
In this blog, we have learned about basic things related to Spark RDDs. We also get to know about features of Spark RDD, Spark RDD Operations, and when to use RDDs.