Difference between RDD , DF and DS in Spark

Table of contents
Reading Time: 3 minutes

In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear.

With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet.

so let’s start some discussion about it.

Resilient Distributed Datasets (RDDs) – Rdd is is a fault-tolerant collection of elements that can be operated on in parallel.
By the rdd, we can perform operations on data on the different nodes of the same cluster parallelly so it’s helpful in increasing the performance.

How we can create the RDD

Spark context(sc) helps to create the rdd in the spark. it can create the rdd from –

  1. external storage system like HDFS, HBase, or any data source offering a Hadoop InputFormat.
  2. parallelizing an existing collection in your driver program.

Let’s see the example for creating rdd of both types –

Creating rdd from parallelizing an existing collection –

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

Here we use parallelize method of sc for creating rdd . now we can perform any operation on it. Basically, there are two type of operations defined –

  1. Transformation – In transformation takes an rdd perform some operations on it and return new rdd  .  for example –
      val nrdd = distData.map(_ + 1)

    The map function iterates over every line in RDD and split into new RDDThere is a lot of transformation functions define you can read about it by click HERE .

  2. Actions – Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. Actions take an rdd , perform some operation on it and return a result as a single value instead of new rdd –
    val res = nrdd.reduce(_ + _)

so here reduce is an action which add all elements of nrdd . you can read more on action by clicking HERE

DataFrame(DF) – 

DataFrame is an abstraction which gives a schema view of data. Which means it gives us a view of data as columns with column name and types info, We can think data in data frame like a table in the database.

Like RDD, execution in Dataframe too is lazy triggered. let’s see an example for creating DataFrame –

case class Person(name : String , age:Int)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()

Here we created the spark session object . that help us to create DataFrame.

// For implicit conversions like converting RDDs to DataFrames

import spark.implicits._

Now creating DataFrame –

val df = List(Person("shubham",21),Person("rahul",23)).toDF

Here df is a DataFrame and now we can apply different operations of DataFrame on it like –

df.show() //schema view of data
// +-------+---+
// | name  |age|
// +-------+---+
// |shubham| 21|
// | rahul | 23|
// +-------+---+ 
df.filter("age > 21").show() 

// +-----+---+
// | name|age|
// +-----+---+
// |rahul| 23|
// +-----+---+

But there are some limitations in DataFrame –

DataFrame Limitations:-
Compile-time type safety: Dataframe API does not support compile time safety which limits you from manipulating data when the structure is not known. The following example works during compile time. However, you will get a Runtime exception when executing this code.

val dataframe =  spark.sql("select * from users where salary > 10000").show()
throws Exception : cannot resolve 'salary' given input age , name

so this is a weak point of DataFrame. DataSet can handle this type of situation.
There are many functions defined for play with data frame  . you can read more about data frame by click HERE

DataSet(DS):-

  • It is an extension to Dataframe API, the latest abstraction which tries to provide best of both RDD and Dataframe.
  • Datasets API provides compile time safety which was not available in Data frames.
  • case class is used to define the structure of data schema in Dataset. Using case class, it’s very easy to work with the dataset. Names of different attributes in case class are directly mapped to attributes in Dataset.
  • we can always convert a data frame at any point of time into a dataset by calling ‘as’ method on Dataframe. e.g. df.as[MyClass]
 val ds = List(Users("shubham",21),Users("rahul",23)).toDS
      ds.filter(_.age > 21).show()
//  +-----+---+
//  | name|age|
//  +-----+---+
//  |rahul| 23|
//  +-----+---+

But if we try to find out salary which is not part of data then it gives compile time error

ds.filter(_.salary > 21).show() - will give comile time error

so this is all about dataSet you can read about more dataSet by click HERE

Hope this blog will help you to clear the difference between RDD,DF, and DS. if you have any doubt yet then you can ask me by commenting here.

References:

  1. https://spark.apache.org/documentation.html
  2. https://spark.apache.org/docs/2.1.0/sql-programming-guide.html

knoldus-advt-sticker


Written by 

Shubham is a Software Consultant, with experience of more than 1.5 years.He is familiar with Object Oriented Programming Paradigms. He is always eager to learn new and advance concepts. Aside from being a programmer, his hobbies include playing badminton,watching movies, writing blogs. He has experience working in C, C++, CoreJava, Adv Java, HTML, CSS, JS, Ajax.

7 thoughts on “Difference between RDD , DF and DS in Spark4 min read

    1. They introduced Dataframe in spark 1.3 while Dataset was introduced in spark 1.6. So i think Dataset os extension of Dataframe. (Please correct me if i am wrong).

Comments are closed.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading