Running standalone Scala job on Amazon EC2 Spark cluster

Table of contents

Reading Time: 2 minutes

In this blog, I would explain how to run a standalone Scala job on Amazon EC2 Spark cluster from local machine.
This is a simple example to process a file, which is stored on Amazon S3.

Please follow below steps to run a standalone Scala job on Amazon EC2 Spark cluster:

1) If you have not installed Spark on your machine, please follow instructions from here: Download Spark

2) After launching Spark cluster on Amazon EC2 from your local machine, you would be able to see Spark master cluster URL on console. Copy this URL.

3) Store a text file on Amazon S3 and set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY on your machine.

4) Follow the instructions to set up and launch Spark on Amazon EC2 from here:- EC2 Script

5) Create a SimpleJob.scala and store in ~/spark/examples/src/main/scala/spark/examples

6) SimpleJob.scala

package spark.examples

import spark.SparkContext
import SparkContext._

object SimpleJob {
  def main(args: Array[String]) {
    val logFile = "s3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>/<File Name>"
    val sc = new SparkContext("<Spark Master cluster URL>", "Simple Job",
      System.getenv("SPARK_HOME"), Seq("<JAR File Address>"))
    val logData = sc.textFile(logFile)
    val numsa = logData.filter(line => line.contains("a")).count
    val numsb = logData.filter(line => line.contains("b")).count
    println("total a : %s, total b : %s".format(numsa, numsb))
  }
}

AWS_ACCESS_KEY_ID:- Get key from Amazon AWS security credential

AWS_SECRET_ACCESS_KEY:- Get secret key from Amazon AWS security credential

Spark Master cluster URL:- Paste the URL, which you have copied in second step

JAR File Address: JAR file path of cluster, where JAR would be stored after running sbt package on remote machine.

7) Transfer SimpleJob.scala on remote machine:

$rsync -v -e "ssh -i key-file" ~/spark/examples/src/main/scala/spark/examples/SimpleJob.scala root@:~/spark/examples/src/main/scala/spark/examples

8) Go to Spark directory and login into cluster using SSH

ssh -i key-file root@Cluster-Host-Name

9) Now you are on remote machine. Run cd spark

Then run sbt/sbt package

10) Now run ./run spark.examples.SimpleJob. You would be able to see the result.

This is a simple example. If you are new to Spark, this example would help to you. Please let me know your feedback. Your feedback would be highly appreciated.