Finally, after a long wait, Apache Spark 2.0 got released on 26 July 2016, Tuesday. This release is built upon the feedback got from industry, in past two years, regarding Spark and its APIs. This means it has all what Spark developers loved to use and all that which was not liked by developers has been removed.
Since, Spark 2.0 is a major release of Apache Spark, it contains major changes to APIs and libraries of Spark. In order to understand the changes in Spark 2.0, we will be looking at them one by one. So, lets start with Spark Session API.
For a long time, Spark developers were confused between SQLContext and HiveContext, i.e., when to use what. Since, HiveContext was more rich in features than SQLContext, many developers were in favor of using it, but HiveContext required many dependencies to run so, some favored SQLContext.
To end this confusion founders of Spark came up with SparkSession, which is a unified API for both of them in Spark 2.0. Since, SparkSession is a combination of SQLContext and HiveContext, it contains all the features that were present in them.
Now, lets see how to create and use SparkSession.
Here, we can notice one thing that SparkSession is similar to SparkContext, where we provide master and application name. Also, SparkSession provides builtin support for Hive features like writing queries using HiveQL, accessing Hive UDFs, and reading data from Hive tables. For accessing Hive features we need to enable the hive support in SparkSession, like this:
From above code, we have seen that how to create SparkSession. Now, lets use it to read data using Spark Shell. In spark-shell of Spark 2.0, Spark session is already created for us.
In above code snippet we have created a DataFrame using Spark session from a JSON file.
So, in this post we saw how to use SparkSession API, which is the new entry point in Spark 2.0.
In future blogs we will discussing more about changes in Spark 2.0 so, stay tuned 🙂