The inception of the three is somewhat described below:
RDD (Spark1.0) —> Dataframe(Spark1.3) —> Dataset(Spark1.6)
Let us begin with the Resilient Distributed Dataset (RDD).
The crux of a Spark lies in the RDD. It is an immutable distributed collection of elements partitioned across the nodes of the cluster that can be operated on in parallel with low level API allowing easy transformations and actions.
– On unstructured data like streams
– When data manipulation involves constructs of functional programming
– The data access and processing is free of schema impositions
– Require low level transformations and actions
Salient Features of RDDs
It can easily and efficiently process data which is structured as well as unstructured data.
It is available in several programming languages like Java, Scala, Python and R.
– Distributed collection:
It is based on MapReduce operations which are widely popular for processing and generating large sets of data in parallel using distributed algorithm on a cluster. It allows us to write parallel computations, with help of high-level operators, without overhead of work distribution and fault tolerance.
RDDs are collection of records which are partitioned. A partition is a primitive unit of parallel programming in a RDD, and every partition forms a logical division of data which is immutable and generated with transformations on existing partitions.
– Fault tolerant:
In case of a loss of RDD, one can redo the transformation on that same partition in order to achieve the same computation results, rather than doing data replication across multiple nodes.
– Lazy evaluations:
All transformations are lazy, not compute their results right away. In place remembers transformations applied to the dataset. The transformations are performed as an when required and returned to the caller program.
Drawbacks for RDDs
No inbuilt optimization engine: On working with structured data, RDDs do not take advantage of Spark’s advanced optimizers (catalyst optimizer and Tungsten execution engine).
While working on the developers need to optimize each RDD based on its characteristics attributes.
Also unlike Dataframe and Datasets, RDDs don’t infer the schema of the data ingested and user is required to specify it explicitly.
Let us move a step ahead and discuss about DataFrames and Datasets.
DataFrames are immutable distributed collection of data where the data is organised in a relational manner that is named columns drawing parallel to tables in a relational database. The essence of datasets is to superimpose a structure on distributed collection of data in order to allow efficient and easier processing. It is conceptually very equivalent to a table in a relational database. Along with Dataframe, Spark also uses catalyst optimizer.
– It is conceptually equivalent to a table in a relational database, but has richer optimizations.
– Can process structured and unstructured data formats (Avro, CSV, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, MySQL)
– It empowers SQL queries and the DataFrame API.
Drawbacks of DataFrames
The Dataframe API does not support compile time safety which limits user from manipulating data when the structure of the data is not known.
Also after transformation of domain object into dataframe, user cannot regenerate it.
Dataset acquire two discrete APIs characteristics namely strongly typed and un-typed. DataFrame can be seen as a collection of generic type Dataset[Row], where Row can be a generic and un-typed JVM object. And unlike DataFrames, Datasets by default are collection of strongly typed JVM objects. Speaking in java they are mapped by class and in Scala they are governed by case class.
The Datasets provide Static type and runtime type safety. Talking in layman language with Datasets and DataFrames allow us to catch errors at compile time. Another advantage is that DataFrames render a structured view for semi structured data as collection of Datasets[Row].
At the core of the Api is encoder responsible for conversion between jvm object and tabular representation. This representation is stored in the Tungsten Binary Format improving the memory utilisation.
Salient Features :
It has best of both the above Api like
– Functional Programming
– Type Safety
– Query Optimisation
Drawbacks of DataSets
It requires type casting into Strings. Querying currently requires specification of class as String. And later casting of column into data type.
Let us now discuss the type safety figuratively:
|Syntax Error||Runtime||Compile Time||Compile Time|
|Analysis Error||Runtime||Runtime||Compile Time|
Let us go a step ahead and discuss the Performance and Optimisation.
– The DataFrame and Dataset APIs use Catalyst to generate optimised logical and physical plans under Java Scala or Python.
– Also Dataset[T] typed API is optimised for engineering tasks and the DataFrame is faster and suitable for interactive analysis.
The presence of Encoders in Dataset API efficiently serializes and deserializes JVM objects to generate compact bytecode. A smaller bytecode ensures faster execution speeds.
Having discussed all the important aspects related to the Spark APIs, the blog shall remain incomplete if I don’t discuss the use case of each of them against the other.
When to use DataFrame or Datasets:
- Rich Semantics
- High level abstractions
- Domain specific APIs
- Processing of high level operations: filters maps etc
- Use columnar access and lambda functions on semi structured data
When to use Datasets:
- High degree safety at runtime
- Take advantage of typed JVM objects
- Take advantage of Catalyst Optimizer
- Save space
- Faster execution
When to use DataFrames:
- low level functionality
- have tight control
Happy Reading !!
1 thought on “The Dominant APIs of Spark: Datasets, DataFrames and RDDs5 min read”
Very good stuff, Kudos! https://usados.pplware.sapo.pt/author/georgettamc/
Comments are closed.