As I have already discussed in my previous blog Spark: RDD vs DataFrames about the shortcomings of RDDs and how DataFrames overcome them. Now we’ll try to have a look at the shortcomings of DataFrames and how Dataset APIs can overcome them.
A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to the relational tables with good optimization techniques. A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs. Here is an example of how we can construct DataFrame from RDD.
Limitations of DataFrames:-
- We loose type-safety.
- Also with type safety, we also loose upon the use of powerful features like lambda expressions.
As we can see the result is of type Row. Here Row is just like Any in Scala. So to convert it back to type-safe code, we might need to do something like this,
As we can see in the above snippet, we surely do not want to do this, simply because it’s highly error-prone and leads to boilerplate code. Here come Datasets to the rescue.
The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. You can define Dataset objects and then manipulate them using functional transformations (map, flatMap, filter, and so on) similar to an RDD. The benefits are that, unlike RDDs, these transformations are now applied on a structured and strongly typed distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.
- Creating a Dataset from an RDD
To convert an RDD into a Dataset, call rdd.toDS().
- Creating a Dataset from a DataFrame
To convert a DataFrame into a Dataset, call df.as[SomeCaseClass].
- Creating a Dataset from a Sequence
To convert a sequence to a Dataset, call .toDS() on the sequence.
Working with Datasets:-
As now we have a fair idea of what the Datasets are and how they are created. Let’s understand them with the help of an example. We’ll be writing down a simple program for the word count using Datasets.
Interoperating with DataFrames and RDDs:-
You can always go back to DataFrames to leverage their API by calling .toDF on Datasets.
Similarly, one can call .rdd to avail RDD functionalities.
Dataset APIs are nothing but simply a combination of both RDDs and DataFrames. DataFrame API is being introduced for query optimization but lacked type-safety. Dataset API is an enhancement to DataFrames to being back type-safety which was there in RDDs.