In this blog, I would explain how to run a standalone Scala job on Amazon EC2 Spark cluster from local machine.
This is a simple example to process a file, which is stored on Amazon S3.
Please follow below steps to run a standalone Scala job on Amazon EC2 Spark cluster:
1) If you have not installed Spark on your machine, please follow instructions from here: Download Spark
2) After launching Spark cluster on Amazon EC2 from your local machine, you would be able to see Spark master cluster URL on console. Copy this URL.
3) Store a text file on Amazon S3 and set the environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
on your machine.
4) Follow the instructions to set up and launch Spark on Amazon EC2 from here:- EC2 Script
5) Create a SimpleJob.scala
and store in ~/spark/examples/src/main/scala/spark/examples
6) SimpleJob.scala
AWS_ACCESS_KEY_ID
:- Get key from Amazon AWS security credential
AWS_SECRET_ACCESS_KEY
:- Get secret key from Amazon AWS security credential
Spark Master cluster URL
:- Paste the URL, which you have copied in second step
JAR File Address
: JAR file path of cluster, where JAR would be stored after running sbt package
on remote machine.
7) Transfer SimpleJob.scala on remote machine:
$rsync -v -e "ssh -i key-file" ~/spark/examples/src/main/scala/spark/examples/SimpleJob.scala root@:~/spark/examples/src/main/scala/spark/examples
8) Go to Spark directory and login into cluster using SSH
ssh -i key-file root@Cluster-Host-Name
9) Now you are on remote machine. Run cd spark
Then run sbt/sbt package
10) Now run ./run spark.examples.SimpleJob
. You would be able to see the result.
This is a simple example. If you are new to Spark, this example would help to you. Please let me know your feedback. Your feedback would be highly appreciated.