Why Dataset Over DataFrame?

In this Blog We Will Learn What is Really The Advantage That Dataset Api in spark 2 has over Dataframe api

DataFrame is weakly typed and developers aren’t getting the benefits of the type system thats why the Dataset Api is Introduced in spark 2  to understand this thing please look at following scenario

suppose you want to read the result from a csv file in a structured way

scala> val dataframe = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv")
dataframe: org.apache.spark.sql.DataFrame = [ID: int, NAME: string ... 1 more field]

scala> dataframe.select("name").where("ids>1").collect
org.apache.spark.sql.AnalysisException: cannot resolve '`ids`' given input columns: [name]; line 1 pos 0;
'Filter ('ids > 1)
+- Project [name#1]
   +- Relation[ID#0,NAME#1,ADDRESS#2] csv

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

so instead of giving you a compilation error it gives you run time error but in case you used dataset api it will give you this compilation error

scala> val dataset = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv").as[Emp]

dataset: org.apache.spark.sql.Dataset[Emp] = [ID: int, NAME: string ... 1 more field]

dataset is typed because it operates on domain objects

so we can be typesafe here because return type of dataset here is a emp class

and if we try to map it on a wrong column it will give compilation error

scala> dataset.filter("id>0")map{_.name1}
:28: error: value name1 is not a member of Emp

so we can say that datset is an alias to datframe with type safety because it can operate on domain objects unlike dataframe



3 thoughts on “Why Dataset Over DataFrame?

Leave a Reply

%d bloggers like this: