Things to know about Spark RDD

Reading Time: 3 minutes

What is RDD in Spark?

RDD stands for Resilient Distributed Dataset. Spark RDD is the backbone of Apache Spark. That’s why RDD is a fundamental data structure of Apache Spark and RDD is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Each and every dataset in Spark RDD is logically partitioned across many servers so that we can compute the dataset on different nodes of the cluster.

Decomposing the name RDD:

  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so it is able to recompute missing or damaged partitions due to node failures.
  • Distributed, since the Data resides on multiple nodes.
  • Dataset represents records of the data you work with and the user can load the data set externally which can be either JSON file, CSV file, text file, or database via JDBC with no specific data structure.

Spark makes use of the concept of RDD to achieve faster and more efficient MapReduce operations.

Features of Spark RDD

i. In-memory Computation

Spark RDDs have a provision of in-memory computation. It stores intermediate results in distributed memory(RAM) instead of stable storage(disk).

ii. Lazy Evaluations  

By using lazy operations we can implement Transformations in RDDs. In lazy evaluation, we can’t compute results immediately. The generation of results will be depending upon the trigger of action. Thus, the performance of the program is increased. transformations in Apache Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base data set.

Spark computes transformations when an action requires a result for the driver program. Follow this guide for the deep study of Spark Lazy Evaluation.

iii. Fault Tolerance

Spark RDDs are fault-tolerant as they track data lineage information to rebuild lost data automatically on failure. They rebuild lost data on failure using lineage, each RDD remembers how it was created from other datasets (by transformations like a map, join, or groupBy) to recreate itself. Follow this guide for the deep study of RDD Fault Tolerance.

iv. Immutability

The important fact about RDD is, it is immutable. You cannot change the state of RDD. If you want to change the state of RDD, you need to create a copy of the existing RDD and perform your required operations. Hence, the required RDD can be retrieved at any time.

v. Partitioning

Data items in RDDs are usually huge. This data is partitioned and sent across different nodes for distributed computing.

vi. Persistence

Intermediate results generated by RDD are stored to make the computation easy. It makes the process optimized.

vii. Coarse-grained Operations

It applies to all elements in datasets through maps or filter or group by operation.

Spark RDD Operations

RDD in Apache Spark supports two types of operations:

  • Transformation
  • Actions
Operations on RDDs

Transformations

These are functions that accept the existing RDDs as input and output one or more RDDs. However, the data in the existing RDD in Spark does not change as it is immutable. Some of the transformation operations are provided in the table below:

FunctionDescription
map()Returns a new RDD by applying the function on each data element
filter()Returns a new RDD formed by selecting those elements of the source on which the function returns true
reduceByKey()Aggregates the values of a key using a function
groupByKey()Converts a (key, value) pair into a (key, <iterable value>) pair
union()Returns a new RDD that contains all elements and arguments from the source RDD
intersection()Returns a new RDD that contains an intersection of the elements in the datasets

These transformations are executed when they are invoked or called. Every time transformations are applied, a new RDD is created.

Actions

Actions in Spark are functions that return the end result of RDD computations. It uses a lineage graph to load data onto the RDD in a particular order. After all of the transformations are done, actions return the final result to the Spark Driver. Actions are operations that provide non-RDD values. Some of the common actions used in Spark are given below:

FunctionDescription
count()Gets the number of data elements in an RDD
collect()Gets all the data elements in an RDD as an array
reduce()Aggregates data elements into an RDD by taking two arguments and returning one
take(n)Fetches the first n elements of an RDD
foreach(operation)Executes the operation for each data element in an RDD
first()Retrieves the first data element of an RDD

When to use RDDs?

RDD is preferred to use when you want to apply low-level transformations and actions. It gives you a greater handle and control over your data. We can use RDD when the data is highly unstructured such as media or text streams. These are used when you want to add functional programming constructs rather than domain-specific expressions. RDDs are used in situations where the schema is not applied.

Conclusion-

In this blog, we have learned about basic things related to Spark RDDs. We also get to know about features of Spark RDD, Spark RDD Operations, and when to use RDDs.