Reading data from different sources using Spark 2.1

fetching data from different sources using Spark 2.1
Table of contents
Reading Time: 2 minutes

Hi all, In this blog, we’ll be discussing on fetching data from different sources using Spark 2.1 like csv, json, text and parquet files.

So first of all let’s discuss what’s new in Spark 2.1. In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.1 the same effects can be achieved through SparkSession, without explicitly creating the SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession.

SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs.

To create an sbt Spark2.1 project you need to add the following dependencies in your build.sbt file

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"

and your scala version should be 2.11.x . For this project I am using:

scalaVersion := "2.11.8"

To initialise the spark session you can use :

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromCsv")
  .getOrCreate()

Reading data from a csv file

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromCsv")
  .getOrCreate()

val df = spark.read.csv("./src/main/resources/testData.csv")
  //To display dataframe data
  df.show()

Reading data from a Json file

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromJson")
  .getOrCreate()

val df = spark.read.json("./src/main/resources/testJson.json") 
//To display dataframe data 
df.show()

Reading data from a text file

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromTextFile")
  .getOrCreate()

val df = spark.read.text("./src/main/resources/textFile.txt") 
//To display dataframe data 
df.show()

Reading data from a parquet file

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromParquet")
  .getOrCreate()

val df = spark.read.parquet("./src/main/resources/testJson.parquet") 
//To display dataframe data 
df.show()

You can also create a temporary table from a dataframe and perform sql queries on it using the following code:

df.registerTempTable("tempTable")
spark.sqlContext.sql("select * from tempTable").show

Here is the link for a demo project on Spark2.1: ReadingDataUsingSpark2.1

So, follow these steps for fetching data from different sources using Spark 2.1.

Happy coding !!

Written by 

Geetika Gupta is a software consultant having more than 2.5 years of experience. She enjoys coding in languages such as C, C++, Java, Scala and also has a good knowledge of big data technologies like Spark, Hadoop, Hive and Presto and is currently working on Akka-HTTP and dynamoDB. Her hobbies include watching television, listening to music and travelling.

Discover more from Knoldus Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading