Application compatibility for different Spark versions

Table of contents

Reading Time: 3 minutes

Recently spark version 2.1 was released and there is a significant difference between the 2 versions.

Spark 1.6 has DataFrame and SparkContext while 2.1 has Dataset and SparkSession.

Now the question arises how to write code so that both the versions of spark are supported.

Fortunately maven provides the feature of building your application with different profiles.
In this blog i will tell you guys how to make your application compatible with different spark versions.

Lets start by creating a empty maven project.
You can use the maven-archetype-quickstart for setting up your project.
Archetypes provide a basic template for your project and maven has a rich collection of these templates for all your needs.

Once the project setup is done we need to make 3 modules. Lets name them core, spark and spark2 and setting the artifact id of each module to their respective names.

For spark modules the artifact id should be spark-.
For example spark2 module would have artifact id as spark-2.1.0.

Spark module would contain the code for spark 1.6 and spark2 would contain the code for spark 2.1.

Start by creating profiles for the 2 spark modules like this in the parent pom:-

<profiles>
  <profile>
    <id>spark-1.6</id>
    <properties>
      <spark.version>1.6.2</spark.version>
      <scala.binary.version>2.10</scala.binary.version>
      <scala.version>2.10.4</scala.version>
    </properties>
    <modules>
      <module>spark</module>
    </modules>
  </profile>
  <profile>
    <id>spark-2.1</id>
    <properties>
      <spark.version>2.1.0</spark.version>
      <scala.binary.version>2.11</scala.binary.version>
      <scala.version>2.11.8</scala.version>
    </properties>
    <modules>
      <module>spark2</module>
    </modules>
  </profile>
</profiles>

Remove both the spark entries from the tag in parent pom.

Verify the profiles by running the following maven command
1. mvn -Pspark-1.6 clean compile
2. mvn -Pspark-2.1 clean compile

You can see that only the version specific module is included in the build in the Reactor summary.

This will solve our problem of how to handle DataFrame and Dataset.

Lets start writing code by creating a class SparkUtil in both the spark modules.

Spark module (1.6.0)

public class SparkUtil {
  private DataFrame df;  
  public SparkUtil(Object df) {  
    this.df = (DataFrame)df;  
  }  
    public Row[] collect() { return df.collect();} 
}

Spark module (2.1.0)

public class SparkUtil {
  private Dataset df;
  public SparkUtil(Object df) {
    this.df = (Dataset)df;
  }
    public Row[] collect() { return (Row[]) df.collect(); }
}

We can do the same thing when creating SparkContext and SparkSession in Spark 1.6 and 2.1 respectively.

Spark module (1.6.0)

public class SessionManager {
  private HiveContext context;
  public SessionManager() { context = setHiveContext(); }
  private HiveContext setHiveContext() {
    SparkConf conf = new SparkConf().setMaster(configReader.getMasterURL())
        .setAppName(configReader.getAppName());
    JavaSparkContext javaContext = new JavaSparkContext(conf);
    context = javaContext.sc();
  }
  public DataFrame sql(String sqlText) { return context.sql(sqlText);}
}

Spark module (2.1.0)

public class SessionManager {
  private SparkSession sparkSession;
  public SessionManager() { sparkSession = setSparkSession(); }
  private SparkSession setSparkSession() {
    Builder builder = SparkSession.builder().master(configReader.getMasterURL())
        .appName(configReader.getAppName())
        .enableHiveSupport();
      return builder.getOrCreate();
  }
  public Dataset sql(String sqlText) { return sparkSession.sql(sqlText);}
}

All we have to do is call the sql method of the SessionManager class and pass the result i.e DataFrame or Dataset to the SparkUtil.

Now we can use the SessionManager class to run our queries.
To do this we have to put a dependency for our spark module in the core module.

<dependency>
  <groupId>com.knoldus.spark<groupId>
  <artifactId>spark-${spark.version}<artifactId>
  <version>1.0-SNAPSHOT<version>
<dependency>

We had earlier defined the artifact id of the spark modules with their spark version. This would help up bind the specified spark module based on the version provided by the profiles.

SessionManager sessionManager = new SessionManager();
Object result = sessionManager.sql(“Select * from userTable”);

Once we have the result we can call the collect method of the SparkUtil

SparkUtil sparkUtil = new SparkUtil(result);
 Row[] rows = sparkUtil.collect();

Now our spark application can handle both the verisons of spark efficiently.

3 thoughts on “Application compatibility for different Spark versions3 min read”

Comments are closed.

High performance systems

Data Engineering, Strategy and Analytics

Intelligence Driven Decisioning - AI/ML

Cloud Engineering

Architecture Strategy, Audit & Academy

Platforms

KDP

KDSP

Products

Premon

Studio9

Tech Hub

Akka

Scala

Rust

Spark

Functional Java

Kafka

Flink

ML/AI

DevOps

Data Warehouse

Travel

Retail

Finance

Healthcare

Media and Publishing

Consumer Internet

Hi-tech & IoT

Case Studies

Blogs

Books

Community

Resources

OS contributions

Webinars

Knolx

Check out our open positions

Services

Go to Overview

Accelerators

Go to Overview

Platforms

Products

TechHub

Industries

Go to Overview

Travel

Insights

Go to Overview

Application compatibility for different Spark versions

Share the Knol:

Related

Written by Rachel Jones

3 thoughts on “Application compatibility for different Spark versions3 min read”

COMPANY

Sign up to our newsletter

Certificates

Partners

© 2023 Knoldus, Inc. All Rights Reserved.

Part of NashTech

Privacy Policy | Sitemap

Discover more from Knoldus Blogs

Check out our
open positions