Reading data from different sources using Spark 2.1


Hi all,

In this blog, we’ll be discussing on fetching data from different sources like csv, json, text and parquet files.

So first of all let’s discuss what’s new in Spark 2.1. In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.1 the same effects can be achieved through SparkSession, without explicitly creating the SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession.

SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs.

To create an sbt Spark2.1 project you need to add the following dependencies in your build.sbt file

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"

and your scala version should be 2.11.x . For this project I am using:

scalaVersion := "2.11.8"

To initialise the spark session you can use :

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromCsv")
  .getOrCreate()

Reading data from a csv file

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromCsv")
  .getOrCreate()

val df = spark.read.csv("./src/main/resources/testData.csv")
  //To display dataframe data
  df.show()

Reading data from a Json file

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromJson")
  .getOrCreate()

val df = spark.read.json("./src/main/resources/testJson.json") 
//To display dataframe data 
df.show()

Reading data from a text file

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromTextFile")
  .getOrCreate()

val df = spark.read.text("./src/main/resources/textFile.txt") 
//To display dataframe data 
df.show()

Reading data from a parquet file

val spark = SparkSession
  .builder()
  .master("local")
  .appName("ReadDataFromParquet")
  .getOrCreate()

val df = spark.read.parquet("./src/main/resources/testJson.parquet") 
//To display dataframe data 
df.show()

You can also create a temporary table from a dataframe and perform sql queries on it using the following code:

df.registerTempTable("tempTable")
spark.sqlContext.sql("select * from tempTable").show

Here is the link for a demo project on Spark2.1: ReadingDataUsingSpark2.1

Happy coding !!

Advertisements
This entry was posted in apache spark, sbt, Scala, Spark. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s