Running Spark on DC/OS

Devops engineers for long needed an open source tool to make it easy to deploy the code developed through all the ups and downs to reach this far and is considerably more capable of evolving (pun intended). As we all know in this world of agile we need to shift our requirements after a short duration of time. Be it addition of a feature or tweaking an old feature to improve user experience, we always evolve with time.

Hence to deploy our application that frequently we needed to shorten our deployment time to make it easy and feasible for the devops people. DC/OS is such a tool to provide an easy way to deploy our applications without much effort. It is a system which runs over a cluster of different machines/processors. To know more about DC/OS there are several blogs available to read. The need, architecture, deploying services on DC/OS. But today we are here to know how to deploy our spark application on DC/OS.

Running Spark

I am guessing we already know how to deploy the services on DC/OS. These services can be provided by the DC/OS itself (ex: Cassandra, HDFS) or our custom services (or applications) which we can run using marathon JSON. So in this blog, we’ll see how to deploy our spark application on DC/OS.

Install Spark on DC/OS

First, we need to install spark service on DC/OS through which we can run our application. There are two ways for that

  1. Using Command Line: We can install spark service on DC/OS using the following command line on DC/OS CLI.
    dcos package install spark
  2. Using GUI: We can install spark through the catalog section in DC/OS and then we can configure the command line using the following command
    dcos package install spark --cli

Now after installing you can see Spark on your DC/OS services page.

Screenshot from 2018-05-29 18-35-35.png

And you have a new command available for dcos which is

dcos spark

if you type dcos spark on you DC/OS command line, you’ll see something like this

Screenshot from 2018-05-31 21-34-49

These all are the variables which you can fill to run your spark application or view the status of any running spark application.

dcos spark provides us with many options to use for example: –verbose to enable extra logging for spark, providing custom dcos URL (using –custom-dcos-url=value) / custom dcos auth token (using— custom-auth-token=value). But today we’ll be talking about simply running a spark job on DC/OS cluster. We’ll go into depth later on.

dcos spark run

We’ll be using the dcos spark run command to run our spark job over the DC/OS. It takes some argument variables to submit the job exactly like spark submit command does. The syntax is like

dcos spark run --submit-args=ARGS

The value ARGS here is a set of flags that let up provide some necessary information like executor memory/cores when creating a spark cluster. The following are some of the flags you need (maybe) to do basic start with Spark on DC/OS:

  • –name: takes the name which will be assigned to your spark job which can be seen in Mesos UI of DC/OS.
  • –driver-cores: The number of cores to be assigned to the driver (default is “1”)
  • –driver-memory: Memory to be assigned to the driver (default is “1G”)
  • –executor-memory: Memory to be assigned to each executor (default is “1G”)
  • –total-executor-cores: Total cores that will be assigned to the cluster (to be distributed between executors). If you don’t set this, the resultant spark cluster will occupy all the resources available in your DC/OS cluster.
  • –conf=PROP=VALUE: This flag is used if you want to assign any custom spark property to your spark job. PROP is the property name and VALUE is the value you want to assign to it.
    For instance, if you want to enable backpressure in your streaming job to be enabled, you can add –conf=spark.streaming.backpressure.enabled=true to the command.
  • –class: takes the full path to the main class (entry point to the application) and takes the URL of the jar from where Mesos can fetch the jar (a downloadable link maybe HDFS).

Also to run spark we need a URL where we can set up the spark master node. DC/OS is based on the Apache Mesos distributed systems kernel. So for the master URL, we provide the Master URLs for Mesos which looks like “mesos://zk://leader.mesos:2181/mesos”.

Till here we haven’t seen any flag for providing master URL in the submit arguments of the run command. So we’ll be using –conf flag to set the master by adding the flag

--conf=spark.master=mesos://your-mesos-master-and-port/mesos.

to our dcos spark run command.

These were some basic arguments for the run command to submit a spark job. Other than –submit-args DC/OS also provides the following which helps in running the application.

  1. –docker-image: Now with Mesos Docker Support we can also utilize Mesos Docker containerizer. For this, we can use flag –docker-image for providing docker image where the Spark executors will run.
    (Or we can use spark property using –conf in submit arguments by using –conf=spark.mesos.executor.docker.image=DOCKER_SPARK)
  2. –env: The environment variable that we want to be set in our application environment can be provided using this flag.
    (–env=TABLE_NAME=my-database)

Example

It’s time to understand it by an example.

Let’s suppose we need to build a cluster having 8 executors with each executor having 20GB of RAM and 8 cores. The calculation gives us 4 cores * 8 executors = 32 cores in total for the cluster. Now either we need to set up the number of executors (using spark.executor.instances) and total cores for the cluster or we can set the number of cores per executor (using spark.executor.cores) and total cores to distribute the cores among the executor (or to create the required amount of executors).

Let’s call this job as “my_first_dcos_spark_job”. So in this scenario, we can obviously use the –total-executor-cores flag of –submit-args as 32 to set up total cores in the cluster. Now we need to set the max cores one executer can have up to 4 so that we will have 8 executors in our cluster (done the math as above). We can do this by using –conf flag and set up the spark property spark.executor.cores with this. Hence we have our submit-args for setup as:

--submit-args="\
--name=my_first_dcos_spark_job \
--conf=spark.master=mesos://your-mesos-master-and-port/mesos \
--total-executor-cores=32 --conf spark.executor.cores=4 \
--executor-memory="20G" \
--class=path.to.main.Class http://downloadable.link-to.jar\
"

In this case, we are opting for driver configuration as default which is 1core and 1G RAM will be assigned to the driver.

Ok now we only need to add the docker image for spark and some environment variables which can be added using :

--docker-image=path.to.docker.image \
--env=KAFKA_BROKER=some.ip.com:9092 \
--env=KAFKA_TOPIC=some-topic \
--env=KAFKA_GROUP_ID=group-007

The final step is to compose both these parts with the dcos run command. So the final requirements for running our spark application are:

  1. Name of the job will be “my_first_dcos_spark_job”
  2. Master for spark is mesos://your-mesos-master-and-port/mesos (Mesos master URL)
  3. Eight executors with each having 4 cores and 20 GB of RAM
  4. Some environment variables.

And the command that we should be running for this scenario will be:

dcos spark run \
--name="spark2" \
--submit-args="\
--name=my_first_dcos_spark_job \
--conf=spark.master=mesos://your-mesos-master-and-port/mesos \
--total-executor-cores=32 --conf spark.executor.cores=4 \
--executor-memory="20G" \
--class=path.to.main.Class http://downloadable.link-to.jar\
" \
--docker-image=path.to.docker.image \
--env=KAFKA_BROKER=some.ip.com:9092 \
--env=KAFKA_TOPIC=some-topic \
--env=KAFKA_GROUP_ID=group-007

After running this command we’ll get a submission id:

Screenshot from 2018-05-31 21-26-54.png

Now after successfully submitting the job you can see your job running with the provided name ( in this case) in the mesos UI of DC/OS.

mesos

Here in this image, you can see the spark_—-_job is the name that we have provided to spark job and we can see Spark UI on 10.174.3.xxx:4040 (IP under the column “Host”).

The submission id can be used to check the status of service using the following command:

dcos spark status

We can also stop the job using this id:

dcos spark kill

Hope this helped 🙂

For any queries, ask in the comments.


knoldus-advt-sticker


  • https://spark.apache.org/docs/latest/running-on-mesos.html
  • https://spark.apache.org/docs/latest/configuration.html
  • https://docs.mesosphere.com/services/spark/2.1.0-2.2.0-1/quickstart/

 

Written by 

Anuj Saxena is a software consultant having more than 1.5 years of experience. Anuj has worked on functional programming languages like Scala and functional Java and is also familiar with other programming languages such as Java, C, C++, HTML. He is currently working on reactive technologies like Spark, Kafka, Akka, Lagom, Cassandra and also used DevOps tools like DC/OS and Mesos for Deployments. His hobbies include watching movies, anime and he also loves travelling a lot.

1 thought on “Running Spark on DC/OS

Leave a Reply

%d bloggers like this: