diff --git a/4.12-4.14/CNF-Upgrade-Prep.html b/4.12-4.14/CNF-Upgrade-Prep.html index 06553d7..285a91a 100644 --- a/4.12-4.14/CNF-Upgrade-Prep.html +++ b/4.12-4.14/CNF-Upgrade-Prep.html @@ -172,7 +172,7 @@
Before you go any further, please read through the CNF requirements document. +
Before you go any further, please read through the CNF requirements document. In this section a few of the most important points will be discussed but the CNF Requirements Document has additional detail and other important topics.
OpenShift and Kubernetes have some built in features that not everyone takes advantage of called -liveness and readiness probes. +liveness, readiness and startup probes. These are very important when POD deployments are dependent upon keeping state for their application. This document won’t go into detail regarding these probes but please review the OpenShift documentation on how to implement their use.
diff --git a/4.12-4.14/OCP-upgrade-prep.html b/4.12-4.14/OCP-upgrade-prep.html index e59ae9b..3111c40 100644 --- a/4.12-4.14/OCP-upgrade-prep.html +++ b/4.12-4.14/OCP-upgrade-prep.html @@ -147,13 +147,18 @@Firmware is NOT part of the OpenShift upgrade and can not be controlled through OpenShift. However, in Telco we run most of our clusters on baremetal which means that we need to be very mindful of the firmware level on all of our hardware.
+All hardware vendors will advise that it is always best to be on their latest certified version of firmware for their
hardware. In the telco world this comes with a trust but verify approach due to the high throughput nature of telco
CNFs. Therefore, it is important to have a regimented group who can test the current and latest firmware from any vendor
to make sure that all components will work with both. It is not always recommended to upgrade firmware in conjunction
with an OCP upgrade however if it is possible to test the latest release of firmware that will improve the odds that
-you won’t run into issues down the road.
-Upgrading firmware is a very important debate because the process can be very intrusive and has a potential for causing
+you won’t run into issues down the road.
Upgrading firmware is a very important debate because the process can be very intrusive and has a potential for causing a node to require manual interventions before the node will come back online. On the other hand it may be imperative to upgrade the firmware due to security fixes, new required functionality or compatibility with the new release of OCP components. Therefore, it is up to everyone to verify with their hardware vendors, verify compatibility with OCP @@ -179,19 +184,66 @@
There is a set of Red Hat Operators that are NOT part of the cluster operators which are otherwise known as the OLM installed operators. To determine the compatibility of these OLM installed operators there is a great web based tool that can be used to determine which versions of OCP are compatible with specific releases of an Operator. This tool is meant to tell you if you need to upgrade an Operator after each Y-Stream upgrade or if you can wait until you have fully upgraded to the next EUS release. +In Step 9 under the “Upgrade Process Flow” section you will find additional information regarding what you need to do if an Operator needs to be upgraded after performing the first Y-Stream Control Plane upgrade.
++ + | ++Some Operators are compatible with several releases of OCP. So, you may not need to upgrade until you complete the cluster upgrade. This is shown in Step 13 of the Upgrade Process Flow. + | +
Prepare your Machine Config Pool (MCP) labels by grouping your nodes, depending on the number of nodes in your cluster.
-MCPs should be split up into 8 to 10 nodes per group. However, there is no hard fast rule as to how many nodes need to
-be in each MCP. The purpose of these MCPs is to group nodes together so that a group of nodes can be controlled
-independently of the rest. Additional information and examples can be found here, under the canary rollout documentation.
-These MCPs will be used to un-pause a set of nodes during the upgrade process, thus allowing them to be upgraded and
-rebooted at a determined time instead of at the pleasure of the scheduler. Please review the upgrade process flow
-section, below, for more details on the pause/un-pause process.
Prepare your Machine Config Pool (MCP) labels by grouping your nodes, depending on the number of nodes in your cluster. MCPs should be split up into 8 to 10 nodes per group. However, there is no hard fast rule as to how many nodes need to be in each MCP. The purpose of these MCPs is to group nodes together so that the upgrade and reboot of a group of nodes can be controlled independently of the rest. Additional information and examples can be found here, under the EUS to EUS upgrade documentation. +These MCPs will be used to un-pause a set of nodes during the upgrade process, thus allowing them to be upgraded and rebooted at a determined time instead of at the pleasure of the scheduler. Please review the upgrade process flow section, below, for more details on the pause/un-pause process.
+During an upgrade there is always a chance that there will be a problem. Most often the problem is related to hardware failure or needing to be reset. If a problem were to occur, having a set of MCPs in a paused state allows the cluster administrator to make sure there are enough nodes running at all times to keep all applications running. The most important thing is to make sure there are enough resources for all application pods.
+This can vary depending on how many nodes are in the cluster or how many nodes are in a node role. By default the 2 roles in a cluster are master and worker. However, in Telco clusters we quite often split the worker nodes out into 2 separate roles of control plane and data plane. The most important thing is to add MCPs to split out the nodes in each of these two groups.
+For example:
+We have 15 worker nodes in a cluster
+10 worker nodes are in the control plane role
+5 worker nodes are in the data plane role
In this cluster you should split both the CNF control plane and data plane worker node roles into at least 2 MCPs.
+The advantage is that having 2 MCPs will allow you to have one set of nodes that are not affected by the upgrade if you need a stopping point.
An example of a larger cluster might have as many as 100 worker nodes in the control plane role. In this case you would want to make sure that you keep each MCP to around 10 nodes. This will allow you to either unpause multiple MCPs at a time to speed up the upgrade or allow you to separate out the upgrade over at least 2 maintenance windows.
+Here is a depiction of what this would look like:
The division and size of these MCPs can vary depending on many factors. In general the standard division is between 8 -and 10 nodes per MCP to allow the operations team to control how many nodes are taken down at a time.
+The division and size of these MCPs can vary depending on many factors. In general the standard division in large clusters is between 8 and 10 nodes per MCP to allow the operations team to control how many nodes are taken down at a time.
+Here is another example but with a medium cluster.
+There are a total of 6 nodes in this role. The MCPs inside of this role are split into 2 node MCPs.
+The smaller MCPs allow 2 nodes to be upgraded which can then sit through a day to allow for verification of CNF compatibility before completing the upgrade on the other 4 nodes.
In larger clusters there is quite often a need to separate out several nodes for purposes like Load Balancing or other -high throughput purposes, which usually have different machine sets to configure SR-IOV. In these cases we do not want -to upgrade all of these nodes without getting a chance to test during the upgrade. Therefore, we need to separate them -out into at least 3 different MCPs and unpause them individually.
+Smaller cluster example with 1 rack
+The process and pace at which you un-pause the MCPs is determined by your CNFs and their configuration. Please review the sections on PDB and anti-affinity for CNFs. If your CNF can properly handle scheduling within an OpenShift cluster you can un-pause several MCPs at a time and set the MaxUnavailable to as high as 50%. This will allow as many as half of the nodes in an MCPs to restart and upgrade. This will reduce the amount of time that is needed for a specific maintenance window and allow your cluster to upgrade quickly.
+First you can run “oc get mcp” to show your current list of MCPs:
+# oc get mcp
+NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
+master rendered-master-b…e83 True False False 3 3 3
+0 25d
+worker rendered-worker-2….c4f True False False 2 2 2
+ 0 25d
+The process and pace at which you un-pause the MCPs is determined by your CNFs and their configuration. Please review -the sections on PDB and anti-affinity for CNFs. If your CNF can properly handle scheduling within an OpenShift cluster -you can un-pause several MCPs at a time and set the MaxUnavailable to as high as 50%. This will allow as many as half -of the nodes in your MCPs to restart and upgrade. This will reduce the amount of time that is needed for a specific -maintenance window and allow your cluster to upgrade quickly. Hopefully you can see how keeping your PDB and -anti-affinity correctly configured will help in the long run.
+Now list out all of the nodes in your cluster:
++ + | ++Review what is listed in the “ROLES” column, this will get updated as we move through this process. + | +
# oc get no
+NAME STATUS ROLES AGE VERSION
+euschannel-ctlplane-0.test.corp Ready master 25d v1.23.12+8a6bfe4
+euschannel-ctlplane-1.test.corp Ready master 25d v1.23.12+8a6bfe4
+euschannel-ctlplane-2.test.corp Ready master 25d v1.23.12+8a6bfe4
+euschannel-worker-0.test.corp Ready worker 25d v1.23.12+8a6bfe4
+euschannel-worker-1.test.corp Ready worker 25d v1.23.12+8a6bfe4
+Determine, from the above suggestions, how you would like to separate out your worker nodes into machine config pools (MCP).
+This is a 2 step process:
+We add a MCP label to each node
+We apply an MCP to the cluster which will organize the nodes based on their labels
++ + | ++In the following example there are only 2 nodes and 2 MCPs. Therefore, each MCP only has 1 node in each. + | +
We first need to label the nodes so that they can be put into MCPs. We will do this with the following commands:
+oc label node euschannel-worker-0.test.corp node-role.kubernetes.io/mcp-1=
+oc label node euschannel-worker-1.test.corp node-role.kubernetes.io/mcp-2=
++ + | ++The labels will show up when you run the “oc get node” command: + | +
# oc get no
+NAME STATUS ROLES AGE VERSION
+euschannel-ctlplane-0.test.corp Ready master 25d v1.23.12+8a6bfe4
+euschannel-ctlplane-1.test.corp Ready master 25d v1.23.12+8a6bfe4
+euschannel-ctlplane-2.test.corp Ready master 25d v1.23.12+8a6bfe4
+euschannel-worker-0.test.corp Ready mcp-1,worker 25d v1.23.12+8a6bfe4
+euschannel-worker-1.test.corp Ready mcp-2,worker 25d v1.23.12+8a6bfe4
+Now you need to create yaml files that will apply the labels as MCPs in your cluster. Each MCP will have to have a separate file or a separate section (as it is shown below).
+Here is one example:
+---
+apiVersion: machineconfiguration.openshift.io/v1
+kind: MachineConfigPool
+metadata:
+ name: mcp-2
+spec:
+ machineConfigSelector:
+ matchExpressions:
+ - {
+ key: machineconfiguration.openshift.io/role,
+ operator: In,
+ values: [worker,mcp-2]
+ }
+ nodeSelector:
+ matchLabels:
+ node-role.kubernetes.io/mcp-2: ""
+---
+apiVersion: machineconfiguration.openshift.io/v1
+kind: MachineConfigPool
+metadata:
+ name: mcp-1
+spec:
+ machineConfigSelector:
+ matchExpressions:
+ - {
+ key: machineconfiguration.openshift.io/role,
+ operator: In,
+ values: [worker,mcp-1]
+ }
+ nodeSelector:
+ matchLabels:
+ node-role.kubernetes.io/mcp-1: ""
+Apply or create the MCPs through:
+# oc apply -f mcps.yaml
+machineconfigpool.machineconfiguration.openshift.io/mcp-2 created
+Now you can run “oc get mcp” again and your new MCPs will show. It will take a few minutes for your nodes to move into the new MCPs that they are assigned to. However, the nodes will NOT reboot during this time.
++ + | ++You will still see the original worker and master MCPs that are part of the cluster. + | +
This is what it will look like right after you apply the MCPs:
+# oc get mcp
+NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
+master rendered-master-b…e83 True False False 3 3 3 0 25d
+mcp-1 rendered-mcp-1-2…c4f False True True 1 0 0 0 10s
+mcp-2 rendered-mcp-2-2…c4f False True True 1 0 0 0 10s
+worker rendered-worker-2…c4f False True True 0 0 0 2 25d
+This is what it will look like after the MCPs have been completely applied:
+# oc get mcp
+NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
+master rendered-master-b…e83 True False False 3 3 3 0 25d
+mcp-1 rendered-mcp-1-2…c4f True False False 1 1 1 0 7m33s
+mcp-2 rendered-mcp-2-2…c4f True False False 1 1 1 0 51s
+worker rendered-worker-2…c4f True False False 0 0 0 0 25d
+