Parsing and Querying CSV: Apache Spark


In this post, I will be sharing How to parse and query  CSV file’s data using Apache Spark. Querying a CSV data is very easy using the Spark – CSV library. For that matter will be using SQLContext, using SQLContext we can query the data like we do in any database language. We can perform all the operation on data like SELECT and also write the data into a new file.

Alright, enough talking now, Come let’s see How can we do it!

We will set up an SBT project first. After setting up an SBT project we will start by adding required dependencies into build.sbt.

For this project, we will require 3 major dependencies.

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
//for spark 1.6 
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1"
// for spark sqlcontext
libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.5.0"
// this is the show stealer,this is the dependency 
required to parse and query csv data

Alright, now we are done setting up the project and adding dependencies, Come let’s get started.

We will be writing the code in steps so that It will be easier for you to understand each and every line.

 1.val sparkConf = new SparkConf().setAppName("simpleReading").
setMaster("local[2]") //set spark configuration
2.val sparkContext = new SparkContext(sparkConf) // make spark context
3. val sqlContext = new SQLContext(sparkContext) // make sql context
4.val df = sqlContext.read.
    format("com.databricks.spark.csv").
    option("header","true").
    option("inferSchema","true").
load("employee.csv") //load data from a file

5.val selectedCity = df.select("city")
  selectedCity.write.save("employee.csv") 
//select a particular column which is "city" from the file 
and save the selected data into a  new csv

 6.val selectNameAndAge = df.select("name","age") //select 
particular columns
  selectMake.show() //show the selected columns

7.val tempTable = df.registerTempTable("my_table") 
//makes a temporary table my_table
  val usingSQL = sqlContext.sql("select * from my_table") 
//select all the csv file's data in temp table
usingSQL.show() //show the temporary table using show function 

Final code looks like this

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

/**
 * Created by deepak on 7/1/17.
 */
object SparkTest extends App {
  val sparkConf = new SparkConf().setAppName("simpleReading").
setMaster("local[2]")
  //set spark configuration
  val sparkContext = new SparkContext(sparkConf)
  // make spark context
  val sqlContext = new SQLContext(sparkContext) // make sql context

  val df = sqlContext.read.
    format("com.databricks.spark.csv").
    option("header", "true").
    option("inferSchema", "true").load("employee.csv") 
//load data from a file

  val selectedCity = df.select("city")
  selectedCity.write.save("employee.csv")
  //save the data in new csv

  val selectMake = df.select("name", "age") //select particular column
  selectMake.show()
  //show make column

  val tempTable = df.registerTempTable("my_table")
  //makes a temporary table
  val usingSQL = sqlContext
    .sql("select * from my_table") 
//show all the csv file's data in temp table
  usingSQL.show()
}

You can find the mini project with all the code at the link Click here
If you find any challenge, Do let me know in the comments.
If you enjoyed this post, I’d be very grateful if you’d help it spread.Keep smiling, Keep coding!

Advertisements

About deepak028

There is nothing much to describe me.However, I am a very ordinary person who believes in sharing knowledge.
This entry was posted in Scala. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s