Tuning spark on yarn

Spark Stream-Stream Join

In this blog we will learn how to tuning yarn with spark in both mode yarn-client and yarn-cluster,the only requirement to get started is that you must have a hadoop based yarn-spark cluster with you. In case you want to create a cluster you can follow this blog here.

1. yarn-client mode:  In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. To manage the memory first make sure that you have your yarn-site.xml in spark,

  • spark.yarn.am.memory: To increase the memory you should set spark.yarn.am.memory property in spark-defaults.conf but make sure that you do not allocate more memory than capacity of node manager which is defined in yarn-site.xml as yarn.nodemanager.resource.memory-mb or you can also give it when you are running spark submit with –conf parameter

For example $SPARK_HOME/bin/spark-submit –conf spark.yarn.am.memory=1024m

Default value of this property is 512 mb.

  • spark.yarn.am.cores: Number of cores to use for the YARN Application Master in client mode, default value of this property is 1 and you can set it on either spark-defaults.conf or using –conf parameter
  •  yarn.scheduler.capacity.maximum-am-resource-percent: Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running, better to utilise it to 90 percent for best results, you can find this property in $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml

2.  yarn-cluster mode: In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.

  • spark.driver.memory: Amount of memory to use for the driver process, i.e. where SparkContext is initialized.
  • spark.driver.cores: Number of cores to use for the driver in cluster mode, default value of this property is 1 and you can set it on either spark-defaults.conf or using –conf parameter.
  • yarn.scheduler.capacity.maximum-am-resource-percent: Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running, better to utilise it to 90 percent for best results, you can find this property in $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml

knoldus-advt-sticker


 

1 thought on “Tuning spark on yarn

Leave a Reply

%d bloggers like this: