Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.
Before exploring these APIs, let’s understand the need for these APIs.
An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.
Problems with RDDs:
- They express how of a solution better than what i.e., RDD library is bit opaque.
We can see in the above example that by looking at the solution we think that how this reduceByKey transformation is being performed.
- They cannot be optimized by Spark.
- It’s too easy to build an inefficient RDD transformation chain.
Here, we can see that these two filter operations could have been applied in one transformation itself by using AND operator. Spark doesn’t take care of the optimization.
We could see spark didn’t optimize the transformation chain. So, we conclude that RDD API doesn’t take care of the query optimization. This is being handled through DataFrame APIs.
A Spark DataFrame is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates, and can be used with Spark SQL. DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.
Characteristics of DataFrames:
- DataFrame API provides a higher-level abstraction, allowing you to use a query language to manipulate data.
- Avail SQL functionalities.
- Focus on What rather than How.
Here, query optimizations are being handled by the spark.
As we can see, there are three types of logical plan and one physical plan
Analyzed logical plans go through a series of rules to resolve. Then, the optimized logical plan is produced. The optimized logical plan normally allows Spark to plug in a set of optimization rules. You can plug in your own rules for the optimized logical plan.
This optimized logical plan is converted to a physical plan for further execution. These plans lie inside the DataFrame API.
In the optimized logical plan, Spark does optimization itself. It sees that there is no need for two filters. Instead, the same task can be done with only one filter using the
AND operator, so it does execution in one filter.
Physical plan is actual RDD chain which will be executed by the spark.
RDDs were good with characteristics like
- Lazy evaluation, etc
But they lacked query optimization, focusses more on what rather than how of a solution. We have seen how DataFrame overcomes these shortcomings of RDDs.