Spark: Introduction to Datasets

As I have already discussed in my previous blog Spark: RDD vs DataFrames about the shortcomings of RDDs and how DataFrames overcome them. Now we’ll try to have a look at the shortcomings of DataFrames and how Dataset APIs can overcome them.

DataFrames:-

A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to the relational tables with good optimization techniques. A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs. Here is an example of how we can construct DataFrame from RDD.

df

Limitations of DataFrames:-

  1. We loose type-safety.
  2. Also with type safety, we also loose upon the use of powerful features like lambda expressions.

dfprob

As we can see the result is of type Row. Here Row is just like Any in Scala. So to convert it back to type-safe code, we might need to do something like this,

dfpr

As we can see in the above snippet, we surely do not want to do this, simply because it’s highly error-prone and leads to boilerplate code. Here come Datasets to the rescue.

Datasets:-

The Datasets API provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. You can define Dataset objects and then manipulate them using functional transformations (map, flatMap, filter, and so on) similar to an RDD. The benefits are that, unlike RDDs, these transformations are now applied on a structured and strongly typed distributed collection that allows Spark to leverage Spark SQL’s execution engine for optimization.

Creating Datasets:-

  1. Creating a Dataset from an RDD
    To convert an RDD into a Dataset, call rdd.toDS().
    dsrdd



  2. Creating a Dataset from a DataFrame
    To convert a DataFrame into a Dataset, call df.as[SomeCaseClass].
    dsfromdf

  3. Creating a Dataset from a Sequence
    To convert a sequence to a Dataset, call .toDS() on the sequence.seqds

Working with Datasets:-

As now we have a fair idea of what the Datasets are and how they are created. Let’s understand them with the help of an example. We’ll be writing down a simple program for the word count using Datasets.

workingds.png

Interoperating with DataFrames and RDDs:-

You can always go back to DataFrames to leverage their API by calling .toDF on Datasets.

interDF.png

Similarly, one can call .rdd to avail RDD functionalities.

interrdd

Conclusion:-

Dataset APIs are nothing but simply a combination of both RDDs and DataFrames. DataFrame API is being introduced for query optimization but lacked type-safety. Dataset API is an enhancement to DataFrames to being back type-safety which was there in RDDs.

References:-

  1. Databricks Spark Guide
  2. Spark Documentation

 


Knoldus-blog-footer-image

 

 

Written by 

Ayush is a Software Consultant having more than 11 months of experience. He has knowledge of various programming languages like C, C++, Java, Scala, JavaScript and is currently working on Big Data Technologies like Spark, Kafka, ElasticSearch. He is always eager to learn new and advance concepts in order to expand his horizon and apply them in project development with his existing knowledge. His hobbies includes playing Cricket, Travelling, Watching Movies

1 thought on “Spark: Introduction to Datasets

Comments are closed.

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!