Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will be assigned a task and it will consolidate and collect the result back to the driver. A spark application gets executed within the cluster in two different modes – one is cluster mode and the second is client mode.
In the cluster mode, the Spark driver or spark application master will get started in any of the worker machines. So, the client who is submitting the application can submit the application and the client can go away after initiating the application or can continue with some other work. So, it works with the concept of Fire and Forgets.
The question is: when to use Cluster-Mode? If we submit an application from a machine that is far from the worker machines, for instance, submitting locally from our laptop, then it is common to use cluster mode to minimize network latency between the drivers and the executors. In any case, if the job is going to run for a long period time and we don’t want to wait for the result then we can submit the job using cluster mode so once the job submitted client doesn’t need to be online.
How to submit spark application in cluster mode
First, go to your spark installed directory and start a master and any number of workers on a cluster using commands:
./sbin/start-master.sh ./sbin/start-slave.sh spark://<<hostname/ipaddress>>:portnumber - worker1 ./sbin/start-slave.sh spark://<<hostname/ipaddress>>:portnumber - worker2
Then, run command:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://<<hostname/ipaddress>>:portnumber --deploy-mode cluster ./examples/jars/spark-examples_2.11-2.3.1.jar 5(number of partitions)
NOTE: Your class name, Jar File and partition number could be different.
In the client mode, the client who is submitting the spark application will start the driver and it will maintain the spark context. So, till the particular job execution gets over, the management of the task will be done by the driver. Also, the client should be in touch with the cluster. The client will have to be online until that particular job gets completed.
In this mode, the client can keep getting the information in terms of what is the status and what are the changes happening on a particular job. So, in case if we want to keep monitoring the status of that particular job, we can submit the job in client mode. In this mode, the entire application is dependent on the Local machine since the Driver resides in here. In case of any issue in the local machine, the driver will go off. Subsequently, the entire application will go off. Hence this mode is not suitable for Production use cases. However, it is good for debugging or testing since we can throw the outputs on the driver terminal which is a Local machine.
How to submit spark application in client mode
First, go to your spark installed directory and start a master and any number of workers on a cluster. Commands are mentioned above in Cluster mode. Then run the following command:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://<<hostname/ipaddress>>:portnumber --deploy-mode client ./examples/jars/spark-examples_2.11-2.3.1.jar 5(number of partitions)
Meanwhile, it requires only change in deploy-mode which is the client in Client mode and cluster in Cluster mode.
Spark application can be submitted in two different ways – cluster mode and client mode. In cluster mode, the driver will get started within the cluster in any of the worker machines. So, the client can fire the job and forget it. In client mode, the driver will get started within the client. So, the client has to be online and in touch with the cluster. So, if the client machine is “far” from the worker nodes then it makes sense to use cluster mode. If our application is in a gateway machine quite “close” to the worker nodes, the client mode could be a good choice.