In this Blog We Will Learn What is Really The Advantage That Dataset Api in spark 2 has over Dataframe api
DataFrame
is weakly typed and developers aren’t getting the benefits of the type system thats why the Dataset Api is Introduced in spark 2 to understand this thing please look at following scenario
suppose you want to read the result from a csv file in a structured way
scala> val dataframe = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv") dataframe: org.apache.spark.sql.DataFrame = [ID: int, NAME: string ... 1 more field] scala> dataframe.select("name").where("ids>1").collect org.apache.spark.sql.AnalysisException: cannot resolve '`ids`' given input columns: [name]; line 1 pos 0; 'Filter ('ids > 1) +- Project [name#1] +- Relation[ID#0,NAME#1,ADDRESS#2] csv at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
so instead of giving you a compilation error it gives you run time error but in case you used dataset api it will give you this compilation error
scala> val dataset = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv").as[Emp] dataset: org.apache.spark.sql.Dataset[Emp] = [ID: int, NAME: string ... 1 more field]
dataset is typed because it operates on domain objects
so we can be typesafe here because return type of dataset here is a emp class
and if we try to map it on a wrong column it will give compilation error
scala> dataset.filter("id>0")map{_.name1} :28: error: value name1 is not a member of Emp dataset.filter("id>0")map{_.name1}
so we can say that datset is an alias to datframe with type safety because it can operate on domain objects unlike dataframe
Reblogged this on Site Title.
Reblogged this on Site Title.
Reblogged this on Coding, Unix & Other Hackeresque Things.