- Kubernetes Cluster
- How scheduling works
- Labels And Selectors
- taints-and-tolerations
- node-affinity
- resource-requirement-and-limits
- static-pods
- multiple-schedulers
- Set of nodes which may be physical or virtual
- on premise or on cloud
- that host applications in the form of containers
- Host Application as containers
- Manage, Plan, Schedule, Monitor Nodes
-
A database that stores information in key value format
-
It is a simple, reliable, key-value store that is simple, secure and fast.
-
You can download the binary of etcd and run it using
./etcd
. Its starts service on port 2379 by default. You can attach clients to the service to store and retrieve the data. -
The default client that comes with etcd is etcd control client
./etcdctl set key1 value1
. And we can retrieve the data using./etcdctl get key1
-
The ETCD datastore stores information about the cluster like
- Nodes, Pods, Configs, Secrets, Accounts, Roles, Bindings
-
Every change we make to the cluster are updated in the etcd server.
-
Installing ETCD service
- Manual : Install cluster from scratch
- Download the binary and install in the master node yourself.
--advertise-client-urls https://${{INTERNAL_IP}}:2379
: The address on which etcd listens. This should be configured in thekube-api
server when it tries to contact theetcd
service.
- Install using
kube-adm
- This deploys the
etcd
server for you as a pod in thekube-system
namespace
- This deploys the
- Manual : Install cluster from scratch
-
Kubernetes stores data in specific directory structure. The
root
directory is/registry
and under that we have variour kubernetes contructs like minions, pods, replicasets, roles etc -
In highly available environment you will have multiple master nodes in a cluster and then you would also have multiple etcd instances spread across the master nodes. In that case make sure that the etcd instances know about each other by setting the right parameter in the
etcd
service configuration.- The
--initial-cluster controller-0=https://${CONTROLLER0_IP}:2380,controller-1=https://$}CONTROLLER1_IP}:2380
- The
-
ETCDCTL is the CLI tool used to interact with ETCD.
- ETCDCTL can interact with ETCD Server using 2 API versions - Version 2 and Version 3. By default its set to use Version 2. Each version has different sets of commands. For example ETCDCTL version 2 supports the following commands:
etcdctl backup etcdctl cluster-health etcdctl mk etcdctl mkdir etcdctl set
- Whereas the commands are different in version 3
etcdctl snapshot save etcdctl endpoint health etcdctl get etcdctl put
- To set the right version of API set the environment variable ETCDCTL_API command
export ETCDCTL_API=3
- Apart from that, you must also specify path to certificate files so that ETCDCTL can authenticate to the ETCD API Server. The certificate files are available in the etcd-master at the following path
--cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key
- To get all the keys stored by kubernetes
kubectl exec etcd-master -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl get / --prefix --keys-only --limit=10 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key"
- Identifies the right node to place the container on
- based on container's resource requirements, worker node capacity etc
- The scheduler continously monitors the api-server. When ever there is a new node created, it realizes and assigns it to appropriate node and communicates back to the
kube-api
server. - It is only responsible for deciding which pods goes on which node. It does actually places the pods on the nodes.
- It first filters the nodes which cannot accomodate the request
- Then it runs functions which decide which node will be the best fit for the placement of the pod on the node. It ranks the nodes based on this.
- You can write your own scheduler as well.
- How to install the kube-scheduler
- Download the kube-scheduler binary from the kubernetes release page and install it as service.
wget https://../kube-scheduler
- Where you can view the kube-scheduler server options
- If you set it up using the kube-adm tool which deploys the kube-scheduler as a pod in the kube-system namespace on the master node, you can login into the pod
and view the options at the following locations
cat /etc/kubernetes/manifests/kube-scheduler.yaml
- You can also see the running process by listing the processes running on the master node and then searching for the kube-scheduler process
ps -aux | grep kube-scheduler
- If you set it up using the kube-adm tool which deploys the kube-scheduler as a pod in the kube-system namespace on the master node, you can login into the pod
and view the options at the following locations
- Takes care of nodes, responsible for onboarding new nodes to the cluster, handlying situations when nodes become unavailable or gets destroyed
- If desired number of containers are running at any point in time in a replication group.
- The replication controller helps us run multiple instances of a single pod in a kubernetes cluster thus providing high availability. Load Balancing And Scaling
- The replication controller spans across multiple nodes in a cluster and helps to balance the load across multiple pods (in same or different nodes) in a cluster
- creating a
rc-definition.yml
apiVersion: v1
kind: ReplicationController
metadata: # for replication controller
name: myapp-rc
labels:
app: myapp
type: front-end
spec: # for the replication controller
replicas: 3
template: # define a pod template here
metadata: # for the pod
name: myapp-pod
labels:
app: myapp
type: frontend
spec: # for the pod
containers:
- name: nginx-container
image: nginx
- To view the replicationController resources
$ kubectl get replicationcontroller
- Now to view the pods created, you can use
kubectl get pods
Similar to replication controller
- Difference between replicationController and replicaSet is the
selector
defination. This helps the replicaSet identify what pods fall under it. - It can also manage pods that were not created as a part of replicaset creation
- To create a
replicaset-definition.yml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: myapp-replicaset
labels:
app: myapp
type: front-end
spec:
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
name: myapp-pod
labels:
app: myapp
type: front-end
spec:
containers:
- name: nginx-container
image: nginx
- To create
replicationController
run the following command
kubectl create -f replicaset-definition.yml
-
Now to view the pods you can use
kubectl get pods
-
To increase the number of
replicasets
you can use the following scale command
kubectl scale --replicas=6 -f replicaset-definition.yml
# OR
kubectl scale --replicas=6 replicaset myapp-replicaset
- To get the replicaset resources use
kubectl get replicaset
- To delete the replicaset object
kubectl delete replicaset myapp-replicaset
- How replicaset monitors
pods
with specificmatchlabels
- Primary management component of kubernetes
- orchestrating all operations in the cluster
- it exposes the kubernetes api which is used by external users to perform management operations on the cluster
- When you run a kubectl (kube control) command the kube-control utility reaches to the kube-api server. The kube-api server then authenticates and validates the request. It then retrieves the data from the etcd cluster and then responds back with the required information.
- We don't need to use the kube-control command line always, instead we can also invoke the api's directly by sending HTTP requests.
- kube-api server is available as a binary in kubernetes release page. If not already present on the master node then you need to download and configure it on the master node.
kubectl get nodes
curl -X POST /api/v1/namespaces/defaults/pods/...[other]
-
The kube-api server is responsible for
- Authenticate User
- Validate Request
- Retrieve data and Update data on ETCD cluster
- Scheduler uses the api server to perform updates in the cluster
- Kubelet uses the api server to perform updates in the cluster
-
run time arguments worth knowing
--etcd-servers=https://127.0.0.1:2379
- how the kubeapi server connects to the etcd server
-
view kube-api server options in existing cluster
- If you deploy the cluster using kubeadm which deploys the
kube-api
server as a pod in the namespacekube-system
-kubectl get pods -n kube-system
- Login into this pod and see the options at -cat /etc/kubernetes/manifests/kube-apiserver.yaml
- In non kubeadm set you can view the options by following command
-
cat /etc/systemd/system/kube-apiserver.service
- You can also search for the kube-apiserver process on the master node and list the corresponding optionsps -aux | grep kube-apiserver
- If you deploy the cluster using kubeadm which deploys the
- We need a software that can run the containers i.e. container runtime engine (eg docker).
- We need docker or its equivalent to be installed on all nodes of the cluster including the master nodes
- kubernetes supports other runtime engines as well like containerd
- Its an agent that runs on each node in the cluster
- It listens for instructions from the kube-api server and deploys or destroys the containers as required
- kube-api server periodically fetches status reports from the kubelet to monitor the status of the nodes on them
- The kubelet in the kubernetes worker node registers the node with the kuernetes cluster. When it receives an instruction to load a container or a pod on the node it requests the container runtime engine (like docker) to pull the required image and run an instance and then continues to monitor the state of the pod and containers in it and reports to the kube-api server on timely basis.
- Install kubelet
- Installing using kubeadm (kubeadm does not automatically deploy the kubelet). You can download the binary, install and run it as a service
You can view the running process and effective options by listing the process on the worker nodeswget https://../kubelet
ps -aux | grep kubelet
- enables the communication between the worker nodes
- ensures that the necessary rules are in place on the worker nodes to allow the containers running on them to reach each other
- within a kubenetes cluster every pod can reach every other pod. This is accomplished by deploying a pod-networking solution to the cluster.
Pod Network
- It is an internal virtual network that expands across all the nodes within the cluster through which all the pods are conntected.
- If we have a web-application deployed on one node and a database application deployed on another node. Then the web-application can reach the database application using the IP of the database. But there is not guarante that the IP of the database would remain the same. That is why we expose the database application by using a service.
- The service does not join the same POD network because the service is not an actual thing. It does not have a container like PODs so it doesn't have any interface or an actively listening process. It is a virtual component that only lives in kubernetes memory
- Kube-proxy is a process that run on each node in the kubernetes cluster. Its job is to look for new services and everytime a new service is created it creates appropriate rule on each node to forward traffic to those services to the backend pods.
- It creates IP table rules on each node in the cluster to forward traffic heading to the IP of the service.
- In the following case it has created rules [1.2.3.6|1.2.3.5] in each of the nodes saying that traffic trying to reach the IP of the service 1.2.3.6 should be forwarded to 1.2.3.5
- Installing
kube-proxy
- download the kube-proxy from the kubernetes release page, install it and run it as a service
wget https://.../kube-proxy
- In kubeadm, the kube-proxy is deployed as a daemonset and therefor on each node in the cluster.
- Manager various controllers in kubernetes
- A controller is a process which continuously monitors the state of various components within the system and works towards bringing the whole system towards the desired functioning state.
- For Example
- The Node controller is responsible for monitoring the status of the nodes and take the necessary action to keep the application running. It does that through the kube-api server.
- The node controller checks the status of the nodes every 5 seconds, in this way the node controller can monitor the status of the nodes.
- If it stops receiving heartbeat from a node, the node is marked as unreachable. But it waits for 40s before marking it as unreachable. After a node is marked as unreachable it waits for 5 minutes for the node to come backup. If it doesn't it removes the pod assigned to that node and provisions them on the healthy ones if the pods are part of the replica set.
Node Monitor Period = 5s
Node Monitor Grace Period = 40s
Pod Eviction Timeout = 5m
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker-1 Ready <none> 10d v1.19.4
worker-2 NotReady <none> 10d v1.19.4
-
The Replication Controller is responsible to monitoring the status of the replicasets and ensuring that the desired number of pods are always available within the set. If a pod dies it creates another one.
-
In the same way there are many such controllers within kubernetes like deployment-controller, namespace-controller, job-controller etc
-
All these controllers are packaged in a single process know as
Kube-Controller-Manager
. -
How to install the kube-controller-manager
- Download the
kube-controller-manager
binary from the kubernetes release page.wget https://../kube-controller-manager
- Extract it and run it as a service.
- Options worth noting down
--node-monitor-period=5s --node-monitor-grace-period=40s --pod-eviction-timeout=5m0s --controllers stringSlice Default:[*]
- The last option you saw is to enable which all controllers you want to enable. By default all of them are enabled.
- Download the
-
So how do you view the kube-controller-manager's server options
- If installed using
kubeadm
. The kubeadm deploys the kube-controller-manager as a pod in the namespace kube-system on the master node. You can see these options indside the pod at the following locationcat /etc/kubernetes/manifests/kube-controller-manager.yaml
- In non kubeadm set up you can view those options at the following location
cat /etc/systemd/system/kube-controller-manager.service
- You can also see the running process and the effective options by listing the processes on the master
node and searching for the kube-controller-manager
ps -aux | grep kube-controller-manager
- If installed using
- A
pod
is the smallest object that you can create in kubernetes - pods usually have one-to-one relationship while scaling your application. When you scale your app, you add more pods (not more containers in the same pod)
- A pod can have multiple containers as well
- The containers in a pod by default would have access to the same storage, same network namespace. The are created together and destroyed together.
- How to deploy pod with an image of nginx
kubectl run nginx --image nginx
- How to get the pods
kubectl get pods
- pod-defination.yml
# The kubernetes-api version we are using to create the object,
apiVersion : v1 # String
# The type of object we are trying to create String
kind : Pod
metadata:
name : myapp-pod # dictionary of objects
labels:
app: myapp
spec:
containers: # List/Array, as pods can have multiple containers
- name: nginx-container # again a dictionary
image: nginx
- finally create object
kubectl create -f pod-defination.yml
Gives us capabilities like
-
webserver you need to deploy in production environment, you need many such instances of the webserver running
-
whenever newer version of the webapp becomes available in the docker registry, you would want to upgrade the webapp in all the instances
-
while upgrading the instances you would like to upgrade gradually (not all at once), rolling update
-
you would also like to be alble to rollback the changes that were recently carried out.
-
So how can we create a deployment. The contents would be same as that of replicaset except for the kind which would now become deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-replicaset
labels:
app: myapp
type: front-end
spec:
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
name: myapp-pod
labels:
app: myapp
type: front-end
spec:
containers:
- name: nginx-container
image: nginx
- You can create deployment resource using
kubectl create -f deployment-definition.yaml
- You can get the deployments using
kubectl get deployments
Note: The deployment automatically creates a replicaset so when you do the following (replicaset in the name of the deployment). The replicaset ultimately creates pods so if you check the pods you would see them as well.
kubectl get replicaset
- To get
all
the objects created
kubectl get all
- all the resources we create get created in the
default
namespace unless explicitly specified. - kubernetes creates some pods and services for its internal purpose such as those required by networking solution,
dns solution, to isolate these from the user in another namespace created at cluster startup
kube-system
- third namespace created by kubernetes is called
kube-public
. This is where resources that should be available to all users created. - Each of the namespaces can have its own set of policies telling who can do what and you can also assign quota of resources to each of the namespaces
- The resources within a namespace can refer to each other simply by their names. For example the
web-app
pod can reach thedb-service
simply by using the hostnamedb-service
. If required theweb-app
pod can reach thedb-service
in another namespace as well. For this it must append the namespace name before i.e.db-service.dev.svc.cluster.local
. We are able to do this because when the service is created the dns entry is automatically added in this format. db-service.dev.svc.cluster.local
- Here
- cluster.local: Default domain name of the kubernetes cluster
- svc: the subdomain for the service
- dev: namespace
- db-service: the name of the service
- Here
- Create a namespace
kubectl create namespace dev
- Using yaml file
apiVersion: v1
kind: Namespace
metadata:
name: dev
- By default our commands are executed in
default
namespace. However when we want to switch the namespace, we can use
kubectl config set-context $(kubectl config current-context) --namespace=kube-system
- To view pods in all namespaces, use the following command
kubectl get pods --all-namespaces
- To limit resources in a namespace, you need to create a resource type ResourceQuota
compute-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: dev
spec:
hard:
pods: "10"
requests.cpu: "4"
requests.memory: 5Gi
limits.cpu: "10"
limits.memory: 10Gi
kubectl create -f compute-quota.yaml
Kubernetes services enable communication between components within and outside of application. For example we have set of pods serving the frontend, set of pods serving the backend. Kubernetes services enable the frontend pods to be available to the end users. It helps in communication between backend and frontend pods and also helps in establishing connectivity to external datasource.
Consider the following scenario
- Laptop has the IP address 192.168.1.10
- K8S node is on the same network and has the IP address 192.168.1.2
- Internal POD network is in the range 10.244.0.0
- POD IP is 10.244.0.2
- If you can SSH into the node then you would be able to access the web-app running
in the POD using
curl http://10.244.0.2 Hello World!
- But how do we access this application from the laptop? This is where the kubernetes service comes into play
- Kubernetes Service is an object (like Pod or replica set). One of its use case is to
LISTEN
to aPORT
on theNODE
andforward
the request on that port, to the PORT on the POD running the web-application. This type of service is called NODE PORT service as it listens to a port on the node and forward the requests to the pod.
- Node Port : Where the service makes an internal POD accessible by a port on the node.
- Cluster IP : The service creates a virtual IP inside the cluster to enable communication between different services such as a set of frontend servers and a set of backend servers.
- Load Balancer : Where it provisions a load balancer for our application in supported cloud providers
service-defination.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp-service
spec:
type: NodePort
ports:
- targetPort: 80 ## Note this is an array and we can have multiple such definations
port: 80
nodePort: 30008
selector: ## This is what tells the service which Pod to hit, you can find them in pod defiation, metadata: labels:
app: myapp
type: front-end
- To create the service
kubectl create -f service-defination.yml
- To view the service
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 79d
myapp-service NodePort 10.106.123.123 <none> 80:30008/TCP 5m
Now if the IP of the Node is 192.168.1.2
then you can access the application via
curl http://192.168.1.2:30008
- Also note that the Service also does the Load Balancing. The same service is able to balance the load between many pods that have the
same
labels
which is same as theselector
on the service. Also the same is true even if the pods span across multiple pods. The same service spans across multiple nodes and is able to balance the load between multiple pods. You can access the web-application by using theIP:PORT
of any of the nodes (in case the service is of type NodePort)
Consider the following example
- help us group the pods together and provide a single interface to access the pods in a group .
- A service created to the backend pods allows the frontend pods to access the backend pods randomly
- Similarly, the service created for redis pods allows the backend pods to access the redis pods randomly.
- So now each layer (backend/ frontend) scale independent of each other without impacting the communication between the services
- Each service gets an IP and name assigned to it inside the cluster and that is the name that should be used by other pods to access the service
- This type of service is called Cluster IP
service-defination.yaml
apiVersion: v1
kind: Service
metadata:
name: back-end
spec:
type: ClusterIP #default
ports:
- port: 80
targetPort: 80
selector:
app: myapp
type: back-end
- To create the service
```bash
kubectl create -f service-defination.yml
- To access the service. Other pods can access the service using
ClusterIP
orservice name
kubectl get services
- We can access the application by using the IP of the Nodes. However since there are many nodes and many apps (like in below example we have 4 nodes and two applications) we will need to remember many IPs to access the application. We will need to configure our own load balancer to forward the requests to all the nodes.
- Load balancer sevice does this job for us. It creates a public IP which we can map to a domain name and then users can access all the applications using the same domainname.
apiVersion: v1
kind: Service
metadata:
name: myapp-service
spec:
type: LoadBalancer
ports:
- targetPort: 80
port: 80
nodePort: 30008
It runs one copy of your pod on each node in your cluster. When ever a new node is added to the cluster, a replica of the pod is automatically added to that node. When a node is removed, the pod is automatically removed.
Usage Cases
- Monitoring Solution
- Logging Solution
- Kube-Proxy can be deployed as a daemon set as well.
- networking solutions like
weave-net
requires an agent to be deployed on each node in the cluster
daemonset-resource.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
- To create a daemonset you can run the following command
kubectl apply -f daemonset-resource.yaml
- To get the daemon set you can use the following command
kubectl get daemonset
- To get more details information you can use
kubectl describe daemonset monitoring-daemon
- So how does it work?
- The daemonset uses NodeAffinity rules and default scheduler to schedule pods on nodes.
-
Every pods has a field called
nodeName
that by default is not set. Kubernetes adds it automatically. The scheduler goes through all the pods and looks for those which do not have this property set, those are the candidates for scheduling. It then identifies the the right node for the pod by running the right scheduling algorithm. Once identified it then schedules the pod for the node by setting thenodeName
property for the pod equal to the name of the node. -
If there is no scheduler for the pods, then the pods remain in Pending state. You can also manually assing pods to the nodes itself. If you set the
nodeName
property to the name of the node. You can only specify the nodeName at the pod creation time. It will not work if you try to do the same to already existing pod. So you might need to delete the pod and create it again with the nodeName
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: kube-01
If you want to assing a node to existing pod without deleting then create a pod-binding object
apiVersion: v1
kind: Binding
metadata:
name: nginx
target:
apiVersion: v1
kind: Node
name: node2
And send a POST request to the pod-binding api with the above data.
-
You can group and select objects using labels-and-selectors
-
You can attach labels to each object as per your needs
-
How to specifiy lables to filter the objects, in the pod-defination.yaml file.
metadata:
name: app
labels:
app: App1
function: Front-end
We can add as many labels as well like. You can also view the pod with a given label
kubectl get pods --selector app=App1
- Similarly labels are also used in
replica-set.yaml
as well to group the selected pods
spec:
replicas: 3
selector:
matchLabels:
app: App1
function: Front-end
- Similarly, a
service.yaml
uses the labels to match the labels on the pods
spec:
selector:
app: App1
Now annotations
are used to record other details for information purpose. For example buildVersion
metadata:
name: simple-webapp
labels:
app: App1
function: Front-end
annotations:
buildversion: 1.34
They are used to set up the restrictions on what pods can be scheduled on a node.
Taints
are set on Nodes and Tolerations
are set on pods. taint-effect
defines what will happen to the pod if they do not
tolerate the taint. There are three taint-effects
- NoSchedule: which means the pods will not be scheduled on the node
- PreferNoSchedule: which means that the system will avoid placing a pod on the node but that's not guaranteed
- NoExecute: which means no new pods will not be scheduled on the node and existing pods on the node if any will be evicted if they do not tolerate the taint
kubectl taint nodes <node-name> <key>=<value>:[NoSchedule|PreferNoSchedule|NoExecute]
kubectl taint nodes node1 app=blue:NoSchedule
Now how can we add the tolerations to the pods.
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: nginx-container
image: nginx
tolerations:
- key: "app"
- operator: "Equal"
- value: "blue"
- effect: "NoSchedule"
When the new pods are created with above tolerations
, they are either not scheduled on the nodes or evicted from
the existing nodes depending on the effect
set
Note:
- NoExecute ensures that a pod with a toleration to a given taint will be accepted on the node with that particular taint. But the pod can still be scheduled on other nodes as well. It also ensures that any other pods that do not have the toleration to the taint and had already been scheduled before, would be removed from the node.
- Also note that the scheduler does not schedule any pods on the master node, this is because the a taint is already created on the master node upon cluster creation.
You can check the taints using following command
$ kubectl describe node docker-desktop | grep Taints
Taints: <none>
This is used to limit the pod on a particular node
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
spec:
containers:
- name: data-processor
image: data-processor
nodeSelector:
size: Large
This also requires labelling the nodes correspondingly. This ensures that the pod is placed on the right node.
kubectl label nodes <node-name> <label-key>=<label-value>
kubectl label nodes node-1 size=Large
- Primary feature of node-affinity is to ensure that the pods are hosted on particular nodes. For example we would want the large data processing pod should end up on node1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: size
operator: In
values:
- Large
- Medium
This tells that the pod
would be placed on any node whose label size has any value in the
list of values specified here. In this case we are using the In
operator. Similarly we can also use
Exists
and NotIn
and other operators as well.
The type of node affinity defines the behaviour of the scheduler w.r.t to the node affinity and the stages on the lifecycle of a pod. Two types are available now
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
Following are two states in the lifecycle of a pod while considering node affinity
- DuringScheduling
- When the pod does not exist and is created for the first time
- Now if the
type
isRequired
then it mandates that the pod has to be placed on a node with given affinity, if that node is not present then the pod would not be scheduled - If the
type
isPreferred
then if the node with given affinity is not found then the scheduler woudld simply place the pod on any other available node.
- DuringExecution
- The pod has been running and a change has been made in the environment that affects node affinity such as change in the label of a node.
- If the type is
Ignored
then any changes in node affinity would not impact the pods once they are scheduled. - If the type is
Required
here then then the existing pods would be evicted if there are any chages in the node affinity of a node i.e. if the label of a nodelarge
is removed, then the pod with this labellarge
would also be evicted.
When a pod is placed on a node. It consumes resources available to that node. It is the kubernetes-scheduler that decides which node a pod goes to. It takes into consideration,
- amount of resources required by a pod
- amount of resources available on the node
The following table shows the amount of resources that a container in a pod requires to be scheduled.
POD
CPU | Memory | Disk |
---|---|---|
0.5 | 256 Mi | 2 |
*NODE
CPU | Memory | Disk |
---|---|---|
------------- | ------------- | ------------- |
------------- | ------------- | ------------- |
------------- | ------------- | ------------- |
10 | 10 | 10 |
You can also set the default values for a namespace by using the following
apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range
spec:
limits:
- default:
memory: 512Mi
defaultRequest:
memory: 256Mi
type: Container
NOTE: 1 CPU = 1000 m = 1 AWS vCPU = 1 GCP Core = 1 Azure Core
NOTE: 256 Mi = 256 Mebibyte
NOTE: 1 Gi = 1 Gibibyte
** In the docker-world ( :) ) a docker container has no limit on the resources it can consume on a node. Say a container starts with 1 vCPU on a node. It can go up and consume as much resources as required suffocating the native processes on the node or other containers of resources. **
Kubernetes sets a resource limit of 1 vCPU to containers if you do not specify explicitly. Kubernetes sets a resource limit of 512 Mi to containers if you do not specify explicitly.
You can modify these values in pod defination file
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
memory: "64Mi" ## Minimum memory for each container
cpu: "250m" ## Minimum cpu for each container
limits:
memory: "128Mi" ## Maximum memory for each container
cpu: "500m" # Maximum cpu for each container
What happens if a pod tries to go beyond the resource limit
- w.r.t to CPU kubernetes throttles the CPU so a container cannot use more CPU than it has been assigned
- however if a container tries to use more memory than its limit, it is allowed but if it tries to do it constantly then the pod would be terminated
The kubelet can manage the node independently as well.
On the hosts we have kubelet
as well as docker
installed to run containers
Suppose if there is not kubernetes cluster (so there is not kube-api-server). The one thing that the kubelet knows to do is to create pods but we don't have the kube-api-server here to provide the pod details
To create a pod we need the details of a pod in the pod defination file. But how do you provide the
- You can configure the
kubelet
to read the pod defination file from a directory on the server designated to store information about pods and place the pods-defination files in this directory. - the kubelet periodically checks this directory for files, reads these files and creates pods on the host
- it can also ensure that the pod stays alive. If the application crashes, the kubelet attempts to restart it
- if you make any changes in any of the files in this directory, the kubelet then recreates the pods for those changes to take effect
- if you remove a pod from this directory then the pod is automatically deleted So these pods which are created by the kubelet on its own without the intervention of the api server or rest of the kubernetes componenets are known as static pods. Remember that you can only create pods in this way and not other kubernetes resources.
So what is that designated folder and how do you configure it?
Path can be any directory on the host and the location of that directory is passed in to the kubelet as an
option while running the service. The option is in kubelet.service
file
ExecStat=/usr/local/bin/kubelet \
--pod-manifest-path=/etc/kubernetes/manifests
Similarly, you can also do it using --config option to give path to a file
ExecStat=/usr/local/bin/kubelet \
--config=kubeconfig.yaml
kubeconfig.yaml
staticPodPath: /etc/kubernetes/manifests
Once the static pods are created you can view them using command
docker ps
The way the kubelet works is that it can take requests for creating pods from different inputs.
- The first is throught the pod defination file from the static pods folder
- the second is through an http endpoint and that is how the kube-api-server provides input to the kubelet
The kubelet can create both kind of pods: the static pods and the pods requested by the kube-api-server both at the same time. The kube-api-server is also aware of the static pods created by the kubelet. So if you run the command
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
static-web-node-0123455 1/1 Running 1 6d20h
command then the static pod would also be listed. You can view the static pod in the list(it is actually a mirror object) but you can edit this object like the usual pods. You can only delete them by modifying the files from the manifests folder.
Why do we need to deploy the static pods? Since static pods are not part of the kubernetes-controle-plane (i.e. kube-api-server, etcd.. etc) you can use static pods to deploy the control-plane componenets itself as pods on a node
- Start by installing kubelet on all the master nodes
- then create pod defination files that uses docker-images of various control-plane components such as
the
api-server.yaml, controller-manager.yaml, etcd.yaml
- Then place these defination files in the designated manifest folder Then the kubelet takes cares of deploying the control-plane-components as pods on a cluster. That's how a kube-admin tool sets up a kubernetes cluster
When you decide to have your own scheduling algorithm to place pods on the nodes so that you can add your own custom conditions and checks in it.
You can write your own kubernetes scheduler program, package it and deploy it as your default scheduler or as an additional scheduler in the kubernetes cluster.
When deployed as a additional scheduler, all the other applications can be scheduler by the default scheduler and you specific application can use the custom scheduler
While creating a pod or a deployemnt you can instruct kubernetes to have the pod scheduled by a specific scheduler.
To deploy an additional scheduler you can use the same kubescheduler binary or a custom binary and set the name parameter
as a custom name my-custom-scheduler
ExecStart=/usr/local/bin/kube-scheduler \\
--config=/etc/kubernetes/config/kube-scheduler.yaml \\
--scheduler-name=my-custom-scheduler
There is one more option you should be aware about --leader-elect=[true|false]
This is used when you have multiple copies of your scheduler running on different master nodes then only one can be
active at a time, that's when this option is used to elect a leader who will lead the scheduling activity.
To get multiple schedulers working you must either set this option to false
if you don't have multiple masters. But in case you do
have multiple masters you can pass in an additional parameter --lock-object-name=my-custom-scheduler
, this is to
differentiate the custom scheduler from the default during the leader election process.
Now how to configure a pod to use the new my-custom-scheduler
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx-container
image: nginx
schedulerName: my-custom-scheduler
This way when the pod is created the right scheduler picks it up for scheduling. You can check if your pod was scheduled by the right scheduler by:
kubectl get events