Running standalone Scala job on Amazon EC2 Spark cluster


In this blog, I would explain how to run a standalone Scala job on Amazon EC2 Spark cluster from local machine.
This is a simple example to process a file, which is stored on Amazon S3.

Please follow below steps to run a standalone Scala job on Amazon EC2 Spark cluster:

1) If you have not installed Spark on your machine, please follow instructions from here: Download Spark

2) After launching Spark cluster on Amazon EC2 from your local machine, you would be able to see Spark master cluster URL on console. Copy this URL.

3) Store a text file on Amazon S3 and set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY on your machine.

4) Follow the instructions to set up and launch Spark on Amazon EC2 from here:- EC2 Script

5) Create a SimpleJob.scala and store in ~/spark/examples/src/main/scala/spark/examples

6) SimpleJob.scala

package spark.examples

import spark.SparkContext
import SparkContext._

object SimpleJob {
  def main(args: Array[String]) {
    val logFile = "s3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>/<File Name>"
    val sc = new SparkContext("<Spark Master cluster URL>", "Simple Job",
      System.getenv("SPARK_HOME"), Seq("<JAR File Address>"))
    val logData = sc.textFile(logFile)
    val numsa = logData.filter(line => line.contains("a")).count
    val numsb = logData.filter(line => line.contains("b")).count
    println("total a : %s, total b : %s".format(numsa, numsb))
  }
}	

AWS_ACCESS_KEY_ID:- Get key from Amazon AWS security credential

AWS_SECRET_ACCESS_KEY:- Get secret key from Amazon AWS security credential

Spark Master cluster URL:- Paste the URL, which you have copied in second step

JAR File Address: JAR file path of cluster, where JAR would be stored after running sbt package on remote machine.

7) Transfer SimpleJob.scala on remote machine:

$rsync -v -e "ssh -i key-file" ~/spark/examples/src/main/scala/spark/examples/SimpleJob.scala root@:~/spark/examples/src/main/scala/spark/examples

8) Go to Spark directory and login into cluster using SSH

ssh -i key-file root@Cluster-Host-Name

9) Now you are on remote machine. Run cd spark

Then run sbt/sbt package

10) Now run ./run spark.examples.SimpleJob. You would be able to see the result.

This is a simple example. If you are new to Spark, this example would help to you. Please let me know your feedback. Your feedback would be highly appreciated.

Advertisements

About Ayush Mishra

Ayush is the Sr. Software Consultant @ Knoldus Software LLP. In his 5 years of experience he has become developer with proven experience in architecting and developing web applications. Ayush has a Masters in Computer Application from U.P. Technical University, Ayush is a strong-willed and self-motivated professional who takes deep care in adhering to quality norms within projects. He is capable of managing challenging projects with remarkable deadline sensitivity without compromising code quality. .
This entry was posted in Agile, Amazon EC2, Cloud, Scala, Spark and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s