In this post, I will be sharing How to parse and query CSV files data using Apache Spark. Querying a CSV data is very easy using the Spark – CSV library. For that matter will be using SQLContext, using SQLContext we can query the data like we do in any database language. We can perform all the operation on data like SELECT and also write the data into a new file.
Alright, enough talking now, Come let’s see How can we parse and query CSV files!
We will set up an SBT project first. After setting up an SBT project we will start by adding required dependencies into build.sbt.
For this project, we will require 3 major dependencies.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1" //for spark 1.6 libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.1" // for spark sqlcontext libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.5.0" // this is the show stealer,this is the dependency required to parse and query csv data
Alright, now we are done setting up the project and adding dependencies, Come let’s get started.
We will be writing the code in steps so that It will be easier for you to understand each and every line.
1.val sparkConf = new SparkConf().setAppName("simpleReading"). setMaster("local[2]") //set spark configuration
2.val sparkContext = new SparkContext(sparkConf) // make spark context
3. val sqlContext = new SQLContext(sparkContext) // make sql context
4.val df = sqlContext.read. format("com.databricks.spark.csv"). option("header","true"). option("inferSchema","true"). load("employee.csv") //load data from a file
5.val selectedCity = df.select("city") selectedCity.write.save("employee.csv") //select a particular column which is "city" from the file and save the selected data into a new csv
6.val selectNameAndAge = df.select("name","age") //select particular columns selectMake.show() //show the selected columns
7.val tempTable = df.registerTempTable("my_table") //makes a temporary table my_table val usingSQL = sqlContext.sql("select * from my_table") //select all the csv file's data in temp table usingSQL.show() //show the temporary table using show function
Final code looks like this
import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} /** * Created by deepak on 7/1/17. */ object SparkTest extends App { val sparkConf = new SparkConf().setAppName("simpleReading"). setMaster("local[2]") //set spark configuration val sparkContext = new SparkContext(sparkConf) // make spark context val sqlContext = new SQLContext(sparkContext) // make sql context val df = sqlContext.read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true").load("employee.csv") //load data from a file val selectedCity = df.select("city") selectedCity.write.save("employee.csv") //save the data in new csv val selectMake = df.select("name", "age") //select particular column selectMake.show() //show make column val tempTable = df.registerTempTable("my_table") //makes a temporary table val usingSQL = sqlContext .sql("select * from my_table") //show all the csv file's data in temp table usingSQL.show() }
You can find the mini project with all the code at the link Click here
If you find any challenge, Do let me know in the comments.
If you enjoyed this post, I’d be very grateful if you’d help it spread.Keep smiling, Keep coding!
I am getting below error while compiling , please correct me
error: not found: type SQLContext val sqlContext = new SQLContext(sparkContext) // make sql context ^ 5 errors found
I am not getting results when tried with usingSQL.show() and selectNameAndAge.show() ..it just shows column names no values showing . Please suggest what could be the problem.
I want to parse csv file, where one column itself has commas in it.
Ex: sandeep, male, 32 , {address: hyd ,area: kkp,pin:500}.
Here last column which is in flower braces should be treated a single column. Can you please let me know how to frame schema for this column.
hi Deepak,
I have a CSV file which is in nmon format. Can scala spark can solve? can u discuss the possible ideas? thanks