In our last post of Play with Spark! series, we saw how to integrate Spark Streaming in a Play Scala application. Now in this blog we will see how to add Spark SQL feature in a Play Scala application.
Spark SQL is a powerful tool of Apache Spark. It allows relational queries, expressed in SQL, HiveQL, or Scala, to be executed using Spark. Apache Spark has a new type of RDD to support queries expressed in SQL format, it is SchemaRDD. A SchemaRDD is similar to a table in a traditional relational database.
To add Spark SQL feature in a Play Scala application follow these steps:
1). Add following dependencies in build.sbt file
The dependency – “org.apache.spark” %% “spark-sql” % “1.0.0” is specific to Spark SQL.
2). Create a file app/utils/SparkSQL.scala & add following code to it
Like any other Spark component, Spark SQL also runs on its own context. Here it is SQLContext. It runs on top of SparkContext. So, first we built sqlContext, so that we can use Spark SQL.
3). In above code you can notice that we have built a case class WordCount.
This case class defines the Schema of Table in which we are going to store data in SQL format.
4). Next we observe that we have mapped variable wordCount to case class WordCount.
Here we are converting wordCount from RDD to SchemaRDD. Then we are registering it as a Table so that we can construct SQL queries to fetch data from it.
5). At last we notice that we have constructed a SQL query in Scala
Here we are fetching the words which occur more than 10 times in our text file. We have used Language-Integrated Relational Queries of Spark SQL which is available only in Scala. To know about other types of SQL queries supported by Spark SQL, click here.
To download a Demo Application click here.