In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear.
With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet.
so let’s start some discussion about it.
Resilient Distributed Datasets (RDDs) – Rdd is is a fault-tolerant collection of elements that can be operated on in parallel.
By the rdd, we can perform operations on data on the different nodes of the same cluster parallelly so it’s helpful in increasing the performance.
How we can create the RDD
Spark context(sc) helps to create the rdd in the spark. it can create the rdd from –
- external storage system like HDFS, HBase, or any data source offering a Hadoop InputFormat.
- parallelizing an existing collection in your driver program.
Let’s see the example for creating rdd of both types –
Creating rdd from parallelizing an existing collection –
val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data)
Here we use parallelize method of sc for creating rdd . now we can perform any operation on it. Basically, there are two type of operations defined –
- Transformation – In transformation takes an rdd perform some operations on it and return new rdd . for example –
val nrdd = distData.map(_ + 1)
The map function iterates over every line in RDD and split into new RDDThere is a lot of transformation functions define you can read about it by click HERE .
- Actions – Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. Actions take an rdd , perform some operation on it and return a result as a single value instead of new rdd –
val res = nrdd.reduce(_ + _)
so here reduce is an action which add all elements of nrdd . you can read more on action by clicking HERE
DataFrame is an abstraction which gives a schema view of data. Which means it gives us a view of data as columns with column name and types info, We can think data in data frame like a table in the database.
Like RDD, execution in Dataframe too is lazy triggered. let’s see an example for creating DataFrame –
case class Person(name : String , age:Int)
import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate()
Here we created the spark session object . that help us to create DataFrame.
// For implicit conversions like converting RDDs to DataFrames
Now creating DataFrame –
val df = List(Person("shubham",21),Person("rahul",23)).toDF
Here df is a DataFrame and now we can apply different operations of DataFrame on it like –
df.show() //schema view of data
// +-------+---+ // | name |age| // +-------+---+ // |shubham| 21| // | rahul | 23| // +-------+---+
df.filter("age > 21").show() // +-----+---+ // | name|age| // +-----+---+ // |rahul| 23| // +-----+---+
But there are some limitations in DataFrame –
Compile-time type safety: Dataframe API does not support compile time safety which limits you from manipulating data when the structure is not known. The following example works during compile time. However, you will get a Runtime exception when executing this code.
val dataframe = spark.sql("select * from users where salary > 10000").show()
throws Exception : cannot resolve 'salary' given input age , name
so this is a weak point of DataFrame. DataSet can handle this type of situation.
There are many functions defined for play with data frame . you can read more about data frame by click HERE
- It is an extension to Dataframe API, the latest abstraction which tries to provide best of both RDD and Dataframe.
- Datasets API provides compile time safety which was not available in Data frames.
- case class is used to define the structure of data schema in Dataset. Using case class, it’s very easy to work with the dataset. Names of different attributes in case class are directly mapped to attributes in Dataset.
- we can always convert a data frame at any point of time into a dataset by calling ‘as’ method on Dataframe. e.g. df.as[MyClass]
val ds = List(Users("shubham",21),Users("rahul",23)).toDS ds.filter(_.age > 21).show()
// +-----+---+ // | name|age| // +-----+---+ // |rahul| 23| // +-----+---+
But if we try to find out salary which is not part of data then it gives compile time error
ds.filter(_.salary > 21).show() - will give comile time error
so this is all about dataSet you can read about more dataSet by click HERE
Hope this blog will help you to clear the difference between RDD,DF, and DS. if you have any doubt yet then you can ask me by commenting here.