Upgrade 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 machine-config-daemon issues #2078
-
Hello I'm trying to upgrade my cluster 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 version. My cluster has 2 masters and 3 workers on Fedora CoreOS 39.20240210.3.0, all are running on proxmox without secureboot. I followed the documentation who explains how to upgrade it: https://okd.io/docs/project/upgrade-notes/from-4-15/force-upgrade-to-stable-4-16 I modified the kube-apiserver-operator deploy config like the documentation and the update began. Practically all components were updated (except machine-config) and the first master rebooted on CentOS Stream CoreOS 416.9.202411211032-0. I have some issues with the 2 machine-config-daemon pods who run on the first master: I0101 15:34:37.800798 123225 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-2860-g4bb33649-dirty (4bb3364914c4dbcdfcc08b0914f402cdd38f014f) The 2nd pod, machine-config-daemon-lnshc get a loopbackoff: I0101 15:37:58.741413 121218 update.go:2641] Disk currentConfig "rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd" overrides node's currentConfig annotation "rendered-worker-84ea878f8910625351bfcf5b66a72542" It tries to download a specific image but it gets another image, I tried to modify the pod configuration to give the good image but I didn't find the original sha (sha256:eb85d903c52970e2d6823d92c880b20609d8e8e0dbc5ad27e16681ff444c8c83). I don't know where is set. I connected to the first node, it has 2 errors: I tried to start the first service, it wait something and then it failed: Jan 01 15:46:00 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:00.923494 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d" For the second service, it try to create a group, it saw that the group exist and it failed: I read some topic on github who it said to use os-tree to rebase the scos-content. rpm-ostree rebase --experimental quay.io/okd/scos-content:tag--stream-coreos For me, the machine-config-daemon try to pull some wrong image, I tried to change to the great sha but I don't know who is set. Does anyone have an idea what should be the problem ? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Hello, I continue to research a solution ;) I checked the filesystem, this one is on xfs except /boot who it is on ext4. I checked the sha on openshift-machine-config-operator on the pod configuration, all sha are good. I tried to delete the mcp worker / master without success. I'm out of idea, I don't understand where could be the cause of the error, rpm-ostream is blocked because of dependency… When I saw the log on the machine-config-daemon it try to access to something but it can't have access to it and then it crash: |
Beta Was this translation helpful? Give feedback.
-
in my case on one of the Nodes, /boot was not mounted Check if your boot is there if not , try the following |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
The issue occurred because Fedora CoreOS version 40 or later was used for the installation. This led to e2fsck being a version later than 1.47, which, by default, enabled the
orphan_file
feature.One of the significant changes introduced with the system upgrade starting from version 4.15 was the transition from a Fedora-based to a CentOS Stream-based , as a result, the e2fsck version was downgraded to 1.46, which does not support the orphan_file feature.
A prerequisite for mounting boot is to pass the check as you can see
Output of file /run/systemd/generator/boot.mount
However, during testing, the non-compatible version resulted in an error
System is unable to proceed with the update u…