Spark: RDD vs DataFrames

Reading Time: 3 minutes

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.
Before exploring these APIs, let’s understand the need for these APIs.

 

RDDs:

An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

Problems with RDDs:

  • They express how of a solution better than what i.e., RDD library is bit opaque.

rddopaque

We can see in the above example that by looking at the solution we think that how this reduceByKey transformation is being performed.

  • They cannot be optimized by Spark.
  • It’s too easy to build an inefficient RDD transformation chain.

rddinefficient

Here, we can see that these two filter operations could have been applied in one transformation itself by using AND operator. Spark doesn’t take care of the optimization.

sparkui

We could see spark didn’t optimize the transformation chain. So, we conclude that RDD API doesn’t take care of the query optimization. This is being handled through DataFrame APIs.

DataFrames:

A Spark DataFrame is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates, and can be used with Spark SQL. DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.

Characteristics of DataFrames:

  • DataFrame API provides a higher-level abstraction, allowing you to use a query language to manipulate data.
  • Avail SQL functionalities.
  • Focus on What rather than How.

dff

Here, query optimizations are being handled by the spark.

explain

As we can see, there are three types of logical plan and one physical plan

Analyzed logical plans go through a series of rules to resolve. Then, the optimized logical plan is produced. The optimized logical plan normally allows Spark to plug in a set of optimization rules. You can plug in your own rules for the optimized logical plan.

This optimized logical plan is converted to a physical plan for further execution. These plans lie inside the DataFrame API.

In the optimized logical plan, Spark does optimization itself. It sees that there is no need for two filters. Instead, the same task can be done with only one filter using the AND operator, so it does execution in one filter.

Physical plan is actual RDD chain which will be executed by the spark.

Conclusion:

RDDs were good with characteristics like

  • Immutability
  • Lazy evaluation, etc

But they lacked query optimization, focusses more on what rather than how of a solution. We have seen how DataFrame overcomes these shortcomings of RDDs.

References:

  1. https://dzone.com/articles/understanding-optimized-logical-plan-in-spark
  2. https://spark.apache.org/docs/latest/

knoldus-advt-sticker

Written by 

Ayush is a Software Consultant having more than 11 months of experience. He has knowledge of various programming languages like C, C++, Java, Scala, JavaScript and is currently working on Big Data Technologies like Spark, Kafka, ElasticSearch. He is always eager to learn new and advance concepts in order to expand his horizon and apply them in project development with his existing knowledge. His hobbies includes playing Cricket, Travelling, Watching Movies