If we want to make a cluster in standalone machine we need to setup some configuration.
We will be using the launch scripts that are provided by Spark, but first of all there are a couple of configurations we need to set
first of all setup a spark environment so open the following file or create if its not available with the help of template file spark-env.sh.template
/conf/spark-env.sh
and add some configuration for the workers like
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_DIR=/home/knoldus/work/sparkdata
Here SPARK_WORKER_MEMORY specifies the amount of memory you want to allocate for a worker node if this value is not given the default value is the total memory available – 1G. Since we are running everything in our local machine we woundt want the slave the use up all our memory.
The SPARK_WORKER_INSTANCES specified the number of instances here its given as 2 since we will only create 2 slave nodes.
The SPARK_WORKER_DIR will be the location that the run applications will run and which will include both logs and scratch space
with the help of above configuration we make a cluster of 2 workers with 1GB worker memory and every Worker use maximum 2 cores
The SPARK_WORKER_CORE will specified the number of core will be use by the worker
After setup environment you should add the IP address and port of the slaves into the following conf file
conf/slaves
when using the launch scripts this file is used to identify the host-names of the machine that the slave nodes will be running, Here we have standalone machine so we set localhost in slaves
Now start master by following command
sbin/start-master.sh
master is running on spark://system_name:7077 for eg spark://knoldus-dell:7077 and you can monitor master with localhost:8080
Now start workers for the master by the following commands
sbin/start-slaves.sh
now your standalone cluster is ready,use it with spark shell,open spark shell with following flag
spark-shell –master spark://knoldus-Vostro-3560:7077
you can also add some configuration of spark like driver memory,number of cores etc
Now run following commands in spark shell
val file=sc.textFile(“READ.md”)
file.count()
file.take(3)
Now you can see which worker work and which worker completed the task at master ui(localhost:8080)
good
Thank you 🙂
Nice descriptive article regarding configuration of spark cluster on standalone machine
“with the help of above configuration we make a cluster of 2 workers with 1GB worker memory and every Worker use maximum 2 cores”
Is that:
a) 2 workers use (max) 2 cores and 1GB
b) 2 workers use (max) 2x (2 cores and 1GB) => 4 cores and 2GB?
super helpful! thanks for the tutorial!
PNG files are not displayed (files not found). Please could you check ?
Could you re-upload images? It would be a lot more useful.