6 min read

How Kubernetes affinity works

NodeAffinity, Pod Affinity, Taint/Toleration DEMYSTIFIED
How Kubernetes affinity works

Intro

Hello everyone, in this post I will try to explain how you can define affinities in your workload so Kubernetes scheduler can apply them.

We can think about Kubernetes as the famous Tetris game. Bricks are pods and the tetris area is the sum of nodes resources. When a new brick comes, is like a new deployment is created and Kubernetes needs to find the best path for it.

This is what Kubernetes scheduler does!

Kubernetes has an objective intelligent in order to decide paths. This base intelligent will optimize cluster memory/cpu usage based on Pods request/limit.

Affinity and anti-affinity come from the need to enrich this Kubernetes intelligent in order to meet Workload needs, for examples:

  • I want (hard constraint) to execute an AI intensive Pod in a node with GPU enabled because without GPU it cannot be run;
  • I prefer (soft constraint) to execute a Web Application Pod near the Redis cache Pod in order to reduce network latency;
  • because I want to reach High Availability of my production project, I want to spread my core Pods into multiple nodes so if one node goes down, my production project will remain available.

These are only 3 examples of custom workloads needs.

So, ladies and gentlemen, I've the honor to mention the Kubernetes affinity tools to make it possible...

  • Node Name: the very dummy (⚠️ not suggested for serious environment) way to schedule a Pod in a specific Node using node name;
  • Node Selector: the simplest way to schedule a Pod in a Node with specific Labels;
  • Node Affinity: this is an upgrade of Node Selector;
  • Pod Affinity and Anti Affinity: like node affinity but based on other pods (with anti-affinity too);
  • Taint And Toleration: unlike the "Affinity" concept, here you can give power to nodes.

Nodename

As I said below (and as the kubernetes official guide said) is a (too) simply and deprecated way to say to kubernetes: "Hey k8s, schedule this pod only to node with name mySuperIncredibleNode".

Cons

If the named node does not exist, the pod will not be run, and in some cases may be automatically deleted. Also, in cloud environment node name is not always predictable.

Usage

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeName: kube-01
NodeName usage

Node Selector

nodeSelector is the simplest recommended way of node selection constraint. With nodeSelector you can schedule your pods to only nodes that match all nodeSelector constraint; so nodeSelector is an hard constraint.

Node Selector, instead of nodeName, is based on node labels which are the right variables to use in order to configure kubernetes scheduler.

Usage

Step1 - Attach labels to node / nodePool

First of all check out your available nodes with kubectl get nodes . Then attach a new label to a node kubectl label nodes <node-name> <label-key>=<label-value>

Example:

kubectl label nodes aks-node-001 gpu=enabled

Step2 - Add nodeSelector to a Pod

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    gpu: enabled

Once deployed, this pod will be scheduled only on node with label gpu=enabled. nodeSelector accepts multiple contraint which are in AND condition.

⚠️ In a real stable and cloud scenario, I recommend to attach labels only to node group because single nodes/workers can be replaced at any times during maintanance/failures/upgrades/autoscaling/etc.

Built-in node labels

Kubernetes nodes come with pre-populated set of standard labels. See Well-Known Labels. These are very useful for example to schedule pods to a certain Zone/Region or in a specific OS (some pods maybe requires Windows OS or be near to a certain geo zone).

Pro

Compared to nodeName this is the simplest recommended way to schedule a pod to a specific node based on labels. It's very simple to use.

Cons

Constraint expression are only hard, you can't specify a soft predicate. Also you can't create more complex expression using both operator AND/OR. You can't create expression based on other pods status.

Affinity concept come to the aid of these "contro" points.

Node Affinity

Node affinity is conceptually similar to nodeSelector.
It allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.

There are currently two types of node affinity:

  • requiredDuringSchedulingIgnoredDuringExecution --> hard constraint. It's like "Hey kubernetes, schedule my Pod only on nodes that satisfy my conditions"
  • preferredDuringSchedulingIgnoredDuringExecution --> soft constraint. It's like "Hey kubernetes, I prefer to schedule my Pod to nodes that satisfy my conditions"

Also I have to say that "nodeSelectorTerms" are OR conditions, otherwhise "matchExpressions" are AND conditions.

Supported operators are: In, NotIn, Exists, DoesNotExist, Gt, Lt.

Usage

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: another-node-label-key
            operator: In
            values:
            - another-node-label-value
  containers:
  - name: with-node-affinity
    image: k8s.gcr.io/pause:2.0
Ref. https://raw.githubusercontent.com/kubernetes/website/master/content/en/examples/pods/pod-with-node-affinity.yaml

Pro

Node Affinity, as we said, is an extension of NodeSelector tool. Relevant feaures are:

  • many supported operators (In, NotIn, Exists, DoesNotExist, Gt, Lt);
  • more complex expression with AND (matchExpressions) and OR (nodeSelectorTerms) conditions;
  • soft (preferredDuringSchedulingIgnoredDuringExecution) and hard (requiredDuringSchedulingIgnoredDuringExecution) conditions support.

Pod Affinity and Anti-Affinity

Inter-pod affinity and anti-affinity are a very nice and powerful feature. It comes from Node Affinity but with these main differences:

  1. Conditions are based on labels on pods that are already running on the node;
  2. Introduction of topologyKey which is the key for the node label that the system uses to denote such a topology domain.

Use Cases

  • Co-locate pods that communicate a lot with each other (pod affinity)
  • Spread pods in different geo zones (pod anti-affinity) for High Availability (HA)

Usage

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: topology.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: topology.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0
https://raw.githubusercontent.com/kubernetes/website/master/content/en/examples/pods/pod-with-pod-affinity.yaml

Going depth: symmetry

In order to understand the complexity over pod affinity and anti-affinity I suggest this lecture about symmetry. Just an intro: "RequiredDuringScheduling anti-affinity is symmetric" it means that if pod A has an hard anti-affinity with pod B it involves that also pod B needs to have an anti-affinity with pod A.

Why I have to know it? Because this will impact Kubernetes scheduler performance!

Ref --> https://github.com/kubernetes/design-proposals-archive/blob/main/scheduling/podaffinity.md#a-comment-on-symmetry

Cons

Pod affinity and anti-affinity add an important complexity to scheduler because every time k8s needs to schedule a pod with pod (anti)affinity it must know/check the other pods. When "other pods" become "thousand pods", scheduler could be slowed down.

Taint and Toleration

We have talk about Affinity which is the pod capabilities to be scheduled in a particular node. Taints are the opposite; they allow a node to repel a set of pods.

Tolerations are POD properties which aim is to match the exact node taints in order to have the right pods running over the right nodes.

So Taints are applied to nodes and Tolerations are applied to pods, they work together to ensure that pods are not scheduled onto inappropriate nodes.

Usage

Taint nodes
This is the kubectl command to taint a node:

kubectl taint nodes node1 key1=value1:NoSchedule

⚠️ As suggested for node labels before, also taints in a stable and cloud env should be added to a node group.


The above example used effect of NoSchedule. Alternatively, you can use effect of PreferNoSchedule. This is a "preference" or "soft" version of NoSchedule -- the system will try to avoid placing a pod that does not tolerate the taint on the node, but it is not required. The third kind of effect is NoExecute which means that if you had nodes already running into this node, they will be evicted.

Once applied, only pods that have key1=value1:NoSchedule Toleration can be scheduled onto this node but if we had already pods inside node1, they will continue run there (until re-scheduling).

Pod Toleration
Now in order to schedule a new pod into node1 we need to apply this toleration (at least):

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"

Use cases

GIF Example of taint / toleration

Taint And Toleration Example

References

Tweets by YBacciarini