The Dominant APIs of Spark: Datasets, DataFrames and RDDs

Table of contents

Reading Time: 4 minutes

While working with Spark often we come across the three APIs: DataFrames, Datasets and RDDs. In this blog I will discuss the three in terms of use case, performance and optimization. It is essential to keep in mind that there is seamless transformation available between the three DataFrames, Datasets and RDDs. Implicitly the RDD forms the apex of both DataFrame and Datasets.

The inception of the three is somewhat described below:

RDD (Spark1.0) —> Dataframe(Spark1.3) —> Dataset(Spark1.6)

Let us begin with the Resilient Distributed Dataset (RDD).

RDD

The crux of a Spark lies in the RDD. It is an immutable distributed collection of elements partitioned across the nodes of the cluster that can be operated on in parallel with low level API allowing easy transformations and actions.

Use –cases

– On unstructured data like streams

– When data manipulation involves constructs of functional programming

– The data access and processing is free of schema impositions

– Require low level transformations and actions

Salient Features of RDDs

– Versatile:

It can easily and efficiently process data which is structured as well as unstructured data.

It is available in several programming languages like Java, Scala, Python and R.

– Distributed collection:

It is based on MapReduce operations which are widely popular for processing and generating large sets of data in parallel using distributed algorithm on a cluster. It allows us to write parallel computations, with help of high-level operators, without overhead of work distribution and fault tolerance.

– Immutable:

RDDs are collection of records which are partitioned. A partition is a primitive unit of parallel programming in a RDD, and every partition forms a logical division of data which is immutable and generated with transformations on existing partitions.

– Fault tolerant:

In case of a loss of RDD, one can redo the transformation on that same partition in order to achieve the same computation results, rather than doing data replication across multiple nodes.

– Lazy evaluations:

All transformations are lazy, not compute their results right away. In place remembers transformations applied to the dataset. The transformations are performed as an when required and returned to the caller program.

Drawbacks for RDDs

No inbuilt optimization engine: On working with structured data, RDDs do not take advantage of Spark’s advanced optimizers (catalyst optimizer and Tungsten execution engine).

While working on the developers need to optimize each RDD based on its characteristics attributes.

Also unlike Dataframe and Datasets, RDDs don’t infer the schema of the data ingested and user is required to specify it explicitly.

Let us move a step ahead and discuss about DataFrames and Datasets.

DataFrames

DataFrames are immutable distributed collection of data where the data is organised in a relational manner that is named columns drawing parallel to tables in a relational database. The essence of datasets is to superimpose a structure on distributed collection of data in order to allow efficient and easier processing. It is conceptually very equivalent to a table in a relational database. Along with Dataframe, Spark also uses catalyst optimizer.

Salient Features:

– It is conceptually equivalent to a table in a relational database, but has richer optimizations.

– Can process structured and unstructured data formats (Avro, CSV, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, MySQL)

– It empowers SQL queries and the DataFrame API.

Drawbacks of DataFrames

The Dataframe API does not support compile time safety which limits user from manipulating data when the structure of the data is not known.

Also after transformation of domain object into dataframe, user cannot regenerate it.

Datasets

Dataset acquire two discrete APIs characteristics namely strongly typed and un-typed. DataFrame can be seen as a collection of generic type Dataset[Row], where Row can be a generic and un-typed JVM object. And unlike DataFrames, Datasets by default are collection of strongly typed JVM objects. Speaking in java they are mapped by class and in Scala they are governed by case class.

The Datasets provide Static type and runtime type safety. Talking in layman language with Datasets and DataFrames allow us to catch errors at compile time. Another advantage is that DataFrames render a structured view for semi structured data as collection of Datasets[Row].

At the core of the Api is encoder responsible for conversion between jvm object and tabular representation. This representation is stored in the Tungsten Binary Format improving the memory utilisation.

Salient Features :

It has best of both the above Api like

– Functional Programming

– Type Safety

– Query Optimisation

– Encoding

Drawbacks of DataSets

It requires type casting into Strings. Querying currently requires specification of class as String. And later casting of column into data type.

Let us now discuss the type safety figuratively:

Error	SQL	DataFrame	DataSets
Syntax Error	Runtime	Compile Time	Compile Time
Analysis Error	Runtime	Runtime	Compile Time

Let us go a step ahead and discuss the Performance and Optimisation.

Performance Optimisation:

– The DataFrame and Dataset APIs use Catalyst to generate optimised logical and physical plans under Java Scala or Python.

– Also Dataset[T] typed API is optimised for engineering tasks and the DataFrame is faster and suitable for interactive analysis.

Space Optimisation:

The presence of Encoders in Dataset API efficiently serializes and deserializes JVM objects to generate compact bytecode. A smaller bytecode ensures faster execution speeds.

Having discussed all the important aspects related to the Spark APIs, the blog shall remain incomplete if I don’t discuss the use case of each of them against the other.

When to use DataFrame or Datasets: