Dataframe and Datasets: Apache Spark's Developers Friendly Structured APIs Dataframe API

Table of contents

Reading Time: 4 minutes

This is a two-part blogs in which first we’ll be covering Dataframe API and in the second part Datasets.

Spark 2.x introduced the concept of structuring the spark by introducing two concepts: – to express some computation by using common patterns found in data analysis, such as filtering, selecting, counting, aggregating, and grouping.

And the second one of order and structure your data in a tabular format like an excel spreadsheet or a SQL table. This leads to the more expressive and well optimised code because of high level DSL operators and APIs.

Let’s see this in action by diving into two structured APIs Dataframe & Datasets:

The Dataframe API

Spark’s Dataframe is a 2-dimensional table data structure that is inspired by pandas dataframe. It distributed them in a column with a column name and a well-defined schema, where each column has a specific data type.

As we all know, if we visualise, the data with proper structured makes way easy to perform common operations in rows and columns and since Dataframes are immutable and spark keeping the information of all transformation. You can add or change the names and data types of the columns.

With that said, let’s see how we can create schemas and dataframes.

Schemas & Dataframe

In spark schema defines the column name, and data type associated with the dataframe. It can be explicit or implicit. There are two ways by which we can create the schema in Spark:

The first way is to schema programmatically for a dataframe. Suppose you have a dataframe with three columns named “batsmanName,” “format,” and “runsScored”. Then, by using the Spark dataframe API

import org.apache.spark.sql.types._

val aSchema = StructType(Array(StructField("batsmanName", StringType, true),
                              StructField("format", StringType, true),
                              StructField("runsScored", IntegerType, true)))

The second way is much easy to create the schema, i.e. by using the DDL(Data Definition Language)

val schema = "batsmanName String, format String, runsScored Int"

Let’s take a complete example to illustrate how you can implement the Dataframe API with data in the form of JSON.

import org.apache.spark.sql.types._

val aSchema = StructType(Array(StructField("Year", IntegerType, false),
                               StructField("First Name", StringType, false),
                               StructField("Country", StringType, false),
                               StructField("Sex", StringType, false),
                               StructField("Count", IntegerType, false)))

val aDF = spark.read.schema(aSchema).option("multiline", true).json("dbfs:/user/datasets/Baby_Names.json")
aDF.show(false)
aDF.printSchema()

The output of the above code will be as: (Image 1 below)

Useful Operations in Dataframes

The first thing you need in order to perform the operations on Dataframe is the data. As a result, Spark does provide an interface from which you can load the data into Dataframe from data sources in formats like JSON, CSV, Parquet, etc.

Dataframe-Reader and Dataframe-Writer

You can easily perform reading & writing operations on spark as it provides high-level abstractions and contribution from the community to connect to a wide variety of data sources, like NoSQL, RDBMSs, and more.

Let’s take an example to get a better understanding of these operations

val sampleDF = spark
 .read
 .option("samplingRatio", 0.001)
 .option("header", true)      .csv("""dbfs:/FileStore/machine_readable_business_employment_data_mar_2022_quarter.csv""")

sampleDF.show()

Image 2 (Remove this after image has been uploaded)

The spark.read.csv() function reads in the CSV file and returns a DataFrame of rows and named columns
with the types dictated in the schema.

Writing Dataframe to a Parquet File and SQL

Parquet is one of the most popular columnar format, it uses the snappy compression to compress the data and preserves the schema whenever the dataframe is written to the parquet format.

Sometimes when you perform some transformation in your dataframe and after that you have to store that transformed df into SQL table. Let’s see this how we can implement this:

Continuing our code from the above example

val sampleDF = spark
 .read
 .option("samplingRatio", 0.001)
 .option("header", true)
.csv("""dbfs:/FileStore/machine_readable_business_employment_data_mar_2022_quarter.csv""")

sampleDF.show()
val sampleDF = spark
 .read
 .option("samplingRatio", 0.001)
 .option("header", true)
 .csv("""dbfs:/FileStore/machine_readable_business_employment_data_mar_2022_quarter.csv""")
sampleDF.show()

val parquetTable = "dataframe_df.business_employment"
val transformedDf = sampleDF.select("Period", "Data_value", "Subject")

val parquetTable = "dataframe_df.business_employment"
val transformedDf = sampleDF.select("Period", "Data_value", "Subject")
transformedDf.write.format("parquet").saveAsTable(parquetTable)

Verifying our output by querying the dataframe_df.business_employment table: (Image 3 below)

Let’s see how we can save the data into a parquet file

val someDF = spark.read.option("multiline","true").json("dbfs:/user/datasets/Baby_Names.json")

someDF.show()

someDF.withColumnRenamed("First Name", "FirstName").write.mode("Overwrite").parquet("dbfs:/FileStore/baby_names_parq")

Output of the above code: (Image4)

Transformation and Actions

Lets quickly recap and quickly go through what does these two things mean: Transformation is a process of transforming the dataframe into a new dataframe without altering the original data. Operations such as “select()”, “filter()”

Whereas an action is a lazy evaluation process in all recorded transformations, let’s see a few examples of it:

transformedDf.where($"Data_value"=== 80078).show()

Note: For using spark sql functions like when, col, lit etc you need to import as done below:

import org.apache.spark.sql.functions._

Conclusion

This was all about for the first part of the Spark’s structured APIs in which we have covered all the basics of Dataframe. If you are reading this conclusion, then thanks for the making it to the end of the blog and if you do like my writing sense then please do checkout my more blog by clicking here.