How to Schedule Pods on Nodes in Kubernetes

Reading Time: 5 minutes

Kubernetes is an orchestrator. Its task is to manage the containerized workload running over its managed environment. Since its an orchestrator, its primary tasks also include scheduling of the pods over the best possible available node that is taken care of by one of the Control Plane’s components – Scheduler.

But what if we want to customize the scheduling of pods in our own defined way to satisfy some of our business use-cases. To mention a few, Scheduling a specific service in a specific zone only or Telling certain pods to always co-allocate on the same node, or spinning up the pods over the nodes which are GPU enabled, etc, etc.

In the above-mentioned scenarios, you somehow have to tell the scheduler about your conditions and the scheduler will work accordingly. For this Kubernetes provides 4 ways of customizing the scheduling of pods.

  • NodeSelector
  • Affinity
  • Taints & Tolerations
  • Node Name

Let’s understand each of them in detail.

Node Selector

Node Selector is the simplest recommended form of node selection constraint. Here, we specify a map of key-value pair in the podSpec under the key nodeSelector and it will basically make the pod eligible to be scheduled only on the nodes which matches that key-value pair. If there is no node with that set of key-value pairs, the pod will remain in the Pending state. Here’s an example for the same

## At node level
kubectl label nodes <node-name> nodetype=ssd-enabled

## At pod level
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    nodetype: ssd-enabled

In the above example, the pod will only schedule on the node which has the label nodetype=ssd-enabled

Affinity

Node Selector will only allow you to constraint the pods to be scheduled on a specific node, but affinity and anti-affinity expand this feature to a great extends. There are 3 types of Affinities available

  • Node Affinity
  • Pod Affinity
  • Pod Anti-Affinity

Node Affinity

This is similar to NodeSelector, but over here, there are a set of benefits you can have over pod scheduling. Here, you are allowed to define whether you want “hard affinity” or “soft affinity”.

Hard Affinity means that the pod will schedule only if it finds the node with matching labels, similar to Node Selector. But here, you can provide multiple options and the pod will schedule on any of the available ones. This is provided under requiredDuringSchedulingIgnoredDuringExecution as follows:

## Pod Spec for Hard NodeAffinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nodetype
            operator: In
            values:
            - ec2-az1
            - ec2-az2

Let’s understand the above example. The above example means that the pod will be scheduled on a node that has the key as nodetype and its value should either be ec2-az1 or ec2-az2. If the cluster doesn’t have a node with those labels, the pod will remain in the pending state.

Soft Affinity means that the pod will be scheduled on a different node if the matching set of labels are not found on any node. This can be configured under preferredDuringSchedulingIgnoredDuringExecution as follows:

## Pod Spec for Soft NodeAffinity
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: nodetype
            operator: In
            values:
            - ec2-az3

Let’s understand the above example. The above example means that the pod will be scheduled on a node that has the key as nodetype and its value should either be ec2-az3. If the cluster doesn’t have any node with such a label, the pod will be scheduled on some other node as decided by the scheduler.

Pod Affinity

Pod Affinity means that the pod will be scheduled on the node where some other pod with the specific label is running. Here, rather than node, we’ll specify the affinity on the basis of pod labels rather than on node labels. Here’s an example for the same.

## Pod Spec with pod affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: topology.kubernetes.io/hostname

In the above example, it will spin up the pod to be scheduled onto a node where there is another pod running which has the label attached with it as security=S1. If no node has such pod running, then this pod will also go into Pending state. The pod Affinity also has a way to specify hard or soft affinity in the same way as it’s there in node Affinity.

Pod Anti Affinity

Pod Anti Affinity means that the pod will not co-allocate with the pods that have the mentioned labels. Here’s an example for the same.

## Pod Spec with Pod anti affinity
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: topology.kubernetes.io/hostname

In the above example, it will spin up the pods onto the nodes where there is no pod running with the labels as security=S2. And since it’s a soft affinity, if all the nodes have the pods with that label, then it will still schedule the pod on any node.

Taints & Tolerations

Where NodeSelector and Affinity focus on attracting the pods to a certain set of nodes, Taints are the opposite. They are used to repel a set of pods from the nodes. Taints are applied on the nodes whereas Tolerations are applied on the pods. Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes.

Here’s an example for the same:

## Taint the node
kubectl taint nodes node1 my-key=my-value:NoSchedule

In the above example, we are tainting node1 by adding a label my-key=my-value and has a taint effect of NoSchedule. Now any pod, which has the Tolerance for the same label and taint effect will be able to schedule itself over node1.

Now, let’s understand what are taint effects. Taint effect specifies that what action will the pod take if it the pod is not able to tolerate the taint. There are 3 taint effects, which are as follows:

  • NoSchedule
    • This means that no pod will be scheduled onto the tainted node unless it has a matching toleration.
  • PreferNoSchedule
    • This is a “soft” preference of not to schedule the pod unless it has a matching toleration. Here the scheduler will try not to place pods over
  • NoExecute
    • If the pod is running over a tained node with NoExecute, then the pods with unmatched Tolerance will be evicted and will be scheduled over a different node.

Here’s an example to provide the Tolerance in the PodSpec

## Adding Toleration to PodSpec
spec:
  tolerations:
  - key: "my-key"
    operator: "my-value"
    effect: "NoSchedule"

In the above example, it says that the pod will only tolerate the nodes which has the label of my-key=my-value and has the taint effect of NoSchedule. If the pod is not able to match the Taint, it will not schedule the pod onto that node.

NodeName

NodeName is the simplest form of node Selection constraint. If we provide this field in the podSpec, the schedule does not perform any scheduling operation by itself and it will simply schedule the pod onto the name which is provided in this field. Here’s an example of how we can provide the NodeName in the PodSpec:

## Adding NodeName in PodSpec
spec:
  nodeName: node-01

In the above example the pod will be scheduled directly onto the node with the name node-01. But this way of scheduling is not preferred due to the following limitations:

  • If the named node does not exist, the pod will not be run.
  • If the named node does not have the resources to accommodate the pod, the pod will fail and its reason will indicate OutOfmemory or OutOfcpu.
  • Node names in cloud environments are not always predictable or stable.

Conclusion

After reading this blog, you will now be able to understand how we can customize the scheduling of pod according to our use-case, but still if you have any doubts/suggestions, you can contact me directly at yatharth.sharma@knoldus.com.

Also, I would like to thank you for sticking to the end. If you like this blog, please do show your appreciation by giving thumbs-ups and share this blog and provide suggestions on how can I improve my future posts to suit your needs. Follow me to get updates on different technologies.