Hello everyone, in this post I will try to explain how you can define affinities in your workload so Kubernetes scheduler can apply them.
We can think about Kubernetes as the famous Tetris game. Bricks are pods and the tetris Area are nodes capabilities. When a new brick arrives, is like a new deployment is created and Kubernetes needs to find the best path for it.
Kubernetes has an objective intelligent in order to decide paths that is agnostic from workload's characteristics. This "basic" intelligent will optimize cluster memory/cpu usage based on Pods request/limit.
Affinity and anti-affinity come from the need to enrich this Kubernetes intelligent in order to meet Workload needs, for examples:
- I want (hard constraint) to execute an AI intensive Pod in a node with GPU enabled because without GPU it cannot be run;
- I prefer (soft constraint) to execute a Web Application Pod near the Redis cache Pod in order to reduce network latency;
- because I want to reach High Availability of my production project, I want to spread my core Pods into multiple nodes so if one node go down, my production project will remain available.
These are only 3 examples of custom workloads needs.
So, ladies and gentlemen, I've the honor to mention the Kubernetes affinity tools to make it possible...
- Node Name: the very dummy (not suggested for serious environment) way to schedule a Pod in a specific Node using node name;
- Node Selector: the simplest way to schedule a Pod in a Node with specific Labels;
- Node Affinity: this is an upgrade of Node Selector;
- Pod Affinity and Anti Affinity: like node affinity but based on other pods (with anti-affinity too);
- Taint And Toleration: unlike the "Affinity" concept, here you can give power to nodes.
As I said below (and as the kubernetes official guide said) is a (too) simply and deprecated way to say to kubernetes: "Hey k8s, schedule this pod only to node with name mySuperNode".
If the named node does not exist, the pod will not be run, and in some cases may be automatically deleted. Also, in cloud environment node name is not always predictable.
nodeSelector is the simplest recommended form of node selection constraint. With nodeSelector you can schedule your pods to only nodes that match all nodeSelector constraint; so nodeSelector is an hard constraint.
Node Selector, instead of nodeName, is based on node labels which are the right variables to use in order to influence kubernetes scheduler.
Step1 - Attach labels to node / nodePool
First of all check out your available nodes with
kubectl get nodes . Then attach a new label to a node
kubectl label nodes <node-name> <label-key>=<label-value>
kubectl label nodes aks-node-001 gpu=enabled
Step2 - Add nodeSelector to a Pod
apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent nodeSelector: gpu: enabled
One deployed this pod will be scheduled only on node with label gpu=enabled. nodeSelector accepts multiple contraint which are in AND condition.
Built-in node lables
In addition to labels you attach, nodes come pre-populated with a standard set of labels. See Well-Known Labels. These are very useful for example to schedule pods to a certain Zone/Region or in a specific OS (some pods maybe requires Windows OS 🤯).
In some cloud environments, if you enable auto scaling (up and down) in a node pool where you have attached at least one label to a node, kubernetes, in case of scale-down can remove your labelled nodes instead of unlabelled nodes. This can impact pods which have nodeSelector constraint.
In these cases I've found 2 solutions, based on use cases:
- Attach labels to node pool instead on single node. In this case, all nodes inside the node pool will inherit node pool labels;
- Enable autoscaling only on Node Pools that not have specific labels.
Compared to nodeName this is the simplest recommended way to schedule a pod to a specific node based on labels. It's very simple to use.
Constraint expression are only hard, you can't specify a soft predicate. Also you can't create more complex expression using both operator AND/OR. You can't create expression based on other pods status.
Affinity concept come to the aid of these "contro" points.
Node affinity is conceptually similar to
nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
There are currently two types of node affinity:
requiredDuringSchedulingIgnoredDuringExecution--> hard constraint. It's like "Hey kubernetes, schedule my Pod only on nodes that satisfy my conditions"
preferredDuringSchedulingIgnoredDuringExecution--> soft constraint. It's like "Hey kubernetes, I prefer to schedule my Pod to nodes that satisfy my conditions"
Also you have to say that "nodeSelectorTerms" are OR conditions, otherwhise "matchExpressions" are AND conditions.
Supported operators are:
Node Affinity, as we said, is an extension of NodeSelector tool. Relevant feaures are:
- operators support (
- more complex expression with AND (matchExpressions) and OR (nodeSelectorTerms) conditions;
- soft (preferredDuringSchedulingIgnoredDuringExecution) and hard (requiredDuringSchedulingIgnoredDuringExecution) conditions support.
Pod Affinity and Anti-Affinity
Inter-pod affinity and anti-affinity are a very nice and powerful feature. It comes from Node Affinity but with these main differences:
- Conditions are based on labels on pods that are already running on the node;
- Introduction of
topologyKeywhich is the key for the node label that the system uses to denote such a topology domain.
- Co-locate pods that communicate a lot with each other (pod affinity)
- Spread pods in different geo zones (pod anti-affinity)
Going depth: symmetry
In order to understand the complexity over pod affinity and anti-affinity I suggest this lecture about symmetry. Just an intro: "RequiredDuringScheduling anti-affinity is symmetric" it means that if pod A has an hard anti-affinity with pod B it involves that also pod B needs to have an anti-affinity with pod A.
Pod affinity and anti-affinity add an important complexity to scheduler because every time k8s needs to schedule a pod with pod (anti)affinity it must know/check the other pods. When "other pods" become "thousand pods", scheduler could be slowed down.
Taint and Toleration
We have talk about Affinity which is the pod capabilities to be scheduled in a particular node. Taints are the opposite; they allow a node to repel a set of pods.
Tolerations are POD properties which aim is to match the exact node taints in order to have the right pods running over the right nodes.
So Taints are applied to nodes and Tolerations are applied to pods, they work together to ensure that pods are not scheduled onto inappropriate nodes.
This is the kubectl command to taint a node:
kubectl taint nodes node1 key1=value1:NoSchedule
(Or if you want to apply taints to a node pool you have to use the provider command line, example azure-cli, ibmcloud, aws-cli..)
The above example used
NoSchedule. Alternatively, you can use
PreferNoSchedule. This is a "preference" or "soft" version of
NoSchedule -- the system will try to avoid placing a pod that does not tolerate the taint on the node, but it is not required. The third kind of
NoExecute which means that if you had nodes already running into this node, they will be evicted.
Once applied, only pods that have key1=value1:NoSchedule Toleration can be scheduled onto this node but if we had already pods inside node1, they will continue run there (until re-scheduling).
Now in order to schedule a new pod into node1 we need to apply this toleration (at least):
tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule"
- Dedicates nodes;
- Nodes with special Hardware;
- Node maintenance;
- Pod evictions (I suggest to read about taint-based-evictions --> https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions) .