Application compatibility for different Spark versions


Recently spark version 2.1 was released and there is a significant difference between the 2 versions.

Spark 1.6 has DataFrame and SparkContext while 2.1 has Dataset and SparkSession.

Now the question arises how to write code so that both the versions of spark are supported.

Fortunately maven provides the feature of building your application with different profiles.
In this blog i will tell you guys how to make your application compatible with different spark versions.

Lets start by creating a empty maven project.
You can use the maven-archetype-quickstart for setting up your project.
Archetypes provide a basic template for your project and maven has a rich collection of these templates for all your needs.

Once the project setup is done we need to make 3 modules. Lets name them core, spark and spark2 and setting the artifact id of each module to their respective names.

For spark modules the artifact id should be spark-.
For example spark2 module would have artifact id as spark-2.1.0.

Spark module would contain the code for spark 1.6 and spark2 would contain the code for spark 2.1.

Start by creating profiles for the 2 spark modules like this in the parent pom:-

<profiles>
  <profile>
    <id>spark-1.6</id>
    <properties>
      <spark.version>1.6.2</spark.version>
      <scala.binary.version>2.10</scala.binary.version>
      <scala.version>2.10.4</scala.version>
    </properties>
    <modules>
      <module>spark</module>
    </modules>
  </profile>
  <profile>
    <id>spark-2.1</id>
    <properties>
      <spark.version>2.1.0</spark.version>
      <scala.binary.version>2.11</scala.binary.version>
      <scala.version>2.11.8</scala.version>
    </properties>
    <modules>
      <module>spark2</module>
    </modules>
  </profile>
</profiles>

Remove both the spark entries from the tag in parent pom.

Verify the profiles by running the following maven command
1. mvn -Pspark-1.6 clean compile
2. mvn -Pspark-2.1 clean compile

You can see that only the version specific module is included in the build in the Reactor summary.

This will solve our problem of how to handle DataFrame and Dataset.

Lets start writing code by creating a class SparkUtil in both the spark modules.

Spark module (1.6.0)

public class SparkUtil {
  private DataFrame df;  
  public SparkUtil(Object df) {  
    this.df = (DataFrame)df;  
  }  
    public Row[] collect() { return df.collect();} 
}

Spark module (2.1.0)

public class SparkUtil {
  private Dataset df;
  public SparkUtil(Object df) {
    this.df = (Dataset)df;
  }
    public Row[] collect() { return (Row[]) df.collect(); }
}

We can do the same thing when creating SparkContext and SparkSession in Spark 1.6 and 2.1 respectively.

Spark module (1.6.0)

public class SessionManager {
  private HiveContext context;
  public SessionManager() { context = setHiveContext(); }
  private HiveContext setHiveContext() {
    SparkConf conf = new SparkConf().setMaster(configReader.getMasterURL())
        .setAppName(configReader.getAppName());
    JavaSparkContext javaContext = new JavaSparkContext(conf);
    context = javaContext.sc();
  }
  public DataFrame sql(String sqlText) { return context.sql(sqlText);}
}

Spark module (2.1.0)

public class SessionManager {
  private SparkSession sparkSession;
  public SessionManager() { sparkSession = setSparkSession(); }
  private SparkSession setSparkSession() {
    Builder builder = SparkSession.builder().master(configReader.getMasterURL())
        .appName(configReader.getAppName())
        .enableHiveSupport();
      return builder.getOrCreate();
  }
  public Dataset sql(String sqlText) { return sparkSession.sql(sqlText);}
}

All we have to do is call the sql method of the SessionManager class and pass the result i.e DataFrame or Dataset to the SparkUtil.

Now we can use the SessionManager class to run our queries.
To do this we have to put a dependency for our spark module in the core module.

<dependency>
  <groupId>com.knoldus.spark<groupId>
  <artifactId>spark-${spark.version}<artifactId>
  <version>1.0-SNAPSHOT<version>
<dependency>

We had earlier defined the artifact id of the spark modules with their spark version. This would help up bind the specified spark module based on the version provided by the profiles.

SessionManager sessionManager = new SessionManager();
Object result = sessionManager.sql(“Select * from userTable”);

Once we have the result we can call the collect method of the SparkUtil

SparkUtil sparkUtil = new SparkUtil(result);
 Row[] rows = sparkUtil.collect();

Now our spark application can handle both the verisons of spark efficiently.


KNOLDUS-advt-sticker

This entry was posted in apache spark, Java, Scala, Spark and tagged , , , , . Bookmark the permalink.

3 Responses to Application compatibility for different Spark versions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s