Upgrade 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 machine-config-daemon issues #2078

MainMan1998 · 2025-01-01T17:17:37Z

MainMan1998
Jan 1, 2025

Hello

I'm trying to upgrade my cluster 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 version.

My cluster has 2 masters and 3 workers on Fedora CoreOS 39.20240210.3.0, all are running on proxmox without secureboot.

I followed the documentation who explains how to upgrade it: https://okd.io/docs/project/upgrade-notes/from-4-15/force-upgrade-to-stable-4-16

I modified the kube-apiserver-operator deploy config like the documentation and the update began.

Practically all components were updated (except machine-config) and the first master rebooted on CentOS Stream CoreOS 416.9.202411211032-0.

I have some issues with the 2 machine-config-daemon pods who run on the first master:
1st pod machine-config-daemon-tbr5k crashed when I check the logs I have these errors:

I0101 15:34:37.800798 123225 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-2860-g4bb33649-dirty (4bb3364914c4dbcdfcc08b0914f402cdd38f014f)
I0101 15:34:37.801214 123225 update.go:2626] Running: mount --rbind /run/secrets /rootfs/run/secrets
I0101 15:34:37.809818 123225 update.go:2626] Running: mount --rbind /usr/bin /rootfs/run/machine-config-daemon-bin
I0101 15:34:37.816334 123225 daemon.go:517] using appropriate binary for source=rhel-9 target=rhel-9
I0101 15:34:38.509638 123225 daemon.go:570] Invoking re-exec /run/bin/machine-config-daemon
I0101 15:34:38.756945 123225 start.go:68] Version: machine-config-daemon-4.6.0-202006240615.p0-2860-g4bb33649-dirty (4bb3364914c4dbcdfcc08b0914f402cdd38f014f)
E0101 15:34:38.757374 123225 rpm-ostree.go:276] Merged secret file does not exist; defaulting to cluster pull secret
I0101 15:34:38.757505 123225 rpm-ostree.go:263] Linking ostree authfile to /var/lib/kubelet/config.json
F0101 15:34:39.225410 123225 start.go:106] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: exit status 1

The 2nd pod, machine-config-daemon-lnshc get a loopbackoff:

I0101 15:37:58.741413 121218 update.go:2641] Disk currentConfig "rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd" overrides node's currentConfig annotation "rendered-worker-84ea878f8910625351bfcf5b66a72542"
I0101 15:37:58.748062 121218 daemon.go:2113] Validating against current config rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd
I0101 15:37:58.748339 121218 daemon.go:2025] SSH key location ("/home/core/.ssh/authorized_keys.d/ignition") up-to-date!
E0101 15:38:00.655362 121218 writer.go:226] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd: expected target osImageURL "quay.io/okd/scos-content@sha256:a1063638f762059609be1f33f4502b734450297c9cd31508d6e41ac1f27e2c04", have "quay.io/openshift/okd-content@sha256:eb85d903c52970e2d6823d92c880b20609d8e8e0dbc5ad27e16681ff444c8c83" ("eb631d7d0a2785a1708594d449d0975b23920c4a3119e5cee7ea4194f4785aa7")
I0101 15:39:00.719655 121218 daemon.go:1580] Previous boot ostree-finalize-staged.service appears successful
I0101 15:39:00.719720 121218 daemon.go:1703] Current config: rendered-worker-84ea878f8910625351bfcf5b66a72542
I0101 15:39:00.719733 121218 daemon.go:1704] Desired config: rendered-worker-b3a57dcbf341fcf2ff062281d8f0c1dd
I0101 15:39:00.719771 121218 daemon.go:1712] state: Degraded
I0101 15:39:00.719830 121218 update.go:2626] Running: rpm-ostree cleanup -r

It tries to download a specific image but it gets another image, I tried to modify the pod configuration to give the good image but I didn't find the original sha (sha256:eb85d903c52970e2d6823d92c880b20609d8e8e0dbc5ad27e16681ff444c8c83). I don't know where is set.

I connected to the first node, it has 2 errors:
systemd-fsck@dev-disk-by\x2duuid-b2cdd0a3\x2d1d56\x2d4d32\x2d95e3\x2d77d622975f7a.service
systemd-sysusers.service

I tried to start the first service, it wait something and then it failed:

Jan 01 15:46:00 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:00.923494 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:00 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:00.926053 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:46:02 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01654|connmgr|INFO|br-ex<->unix#7493: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:46:14 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:14.924031 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:14 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:14.926433 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:46:17 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01655|connmgr|INFO|br-ex<->unix#7502: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:46:29 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:29.924141 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:29 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:29.926243 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:46:32 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01656|connmgr|INFO|br-ex<->unix#7506: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.565480 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-machine-config-operator/kube-rbac-proxy-crio-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.565798 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-etcd/etcd-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.565996 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-kube-controller-manager/kube-controller-manager-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.566123 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube-scheduler-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:37 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:37.566215 2502 kubelet_getters.go:187] "Pod status updated" pod="openshift-kube-apiserver/kube-apiserver-okd4-control-plane-1.okd.ia5-f1.net" status="Running"
Jan 01 15:46:44 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:44.927684 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:44 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:44.929551 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:46:47 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01657|connmgr|INFO|br-ex<->unix#7514: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:46:56 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:46:56.938376 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:46:56 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:46:56.940293 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:47:02 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01658|connmgr|INFO|br-ex<->unix#7518: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device: Job dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device/start timed out.
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Timed out waiting for device /dev/disk/byx2duuid/b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.
░░ Subject: A start job for unit dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device has failed
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A start job for unit dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device has finished with a failure.
░░
░░ The job identifier is 19401 and the job result is timeout.
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Dependency failed for File System Check on /dev/disk/byx2duuid/b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.
░░ Subject: A start job for unit systemd-fsck@dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.service has failed
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A start job for unit systemd-fsck@dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.service has finished with a failure.
░░
░░ The job identifier is 19400 and the job result is dependency.
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: systemd-fsck@dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.service: Job systemd-fsck@dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.service/start failed with result 'dependency'.
Jan 01 15:47:03 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device: Job dev-disk-byx2duuid-b2cdd0a3x2d1d56x2d4d32x2d95e3x2d77d622975f7a.device/start failed with result 'timeout'.
Jan 01 15:47:09 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: I0101 15:47:09.925292 2502 scope.go:117] "RemoveContainer" containerID="f5aeb01967dd1addda481439d9ec19b82cdc7066b0658ce4086ff689df9a9e5d"
Jan 01 15:47:09 okd4-control-plane-1.okd.ia5-f1.net kubenswrapper[2502]: E0101 15:47:09.928263 2502 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "machine-config-daemon" with CrashLoopBackOff: "back-off 5m0s restarting failed cont>
Jan 01 15:47:17 okd4-control-plane-1.okd.ia5-f1.net ovs-vswitchd[1158]: ovs|01659|connmgr|INFO|br-ex<->unix#7526: 2 flow_mods in the last 0 s (1 adds, 1 deletes)

For the second service, it try to create a group, it saw that the group exist and it failed:
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd-sysusers[800]: Creating group 'sgx' with GID 991.
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd-sysusers[800]: /etc/gshadow: Group "sgx" already exists.
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: systemd-sysusers.service: Main process exited, code=exited, status=1/FAILURE
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: systemd-sysusers.service: Failed with result 'exit-code'.
Jan 01 11:22:42 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Failed to start Create System Users.
Jan 01 14:17:33 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Create System Users was skipped because no trigger condition checks were met.

I read some topic on github who it said to use os-tree to rebase the scos-content.
The os-tree failed because dependency (link to the machine config daemon pods crashloopbackoff):

rpm-ostree rebase --experimental quay.io/okd/scos-content:tag--stream-coreos
Jan 01 16:05:31 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: rpm-ostreed.service: Job rpm-ostreed.service/start failed with result 'dependency'.
Jan 01 16:08:05 okd4-control-plane-1.okd.ia5-f1.net systemd[1]: Dependency failed for rpm-ostree System Management Daemon.

For me, the machine-config-daemon try to pull some wrong image, I tried to change to the great sha but I don't know who is set.
I checked if I put a wrong sha when I modified the kube-apiserver-operator they have no error

Does anyone have an idea what should be the problem ?
Many thanks for help :)

Answered by niktsl

Jan 9, 2025

The issue occurred because Fedora CoreOS version 40 or later was used for the installation. This led to e2fsck being a version later than 1.47, which, by default, enabled the orphan_file feature.

One of the significant changes introduced with the system upgrade starting from version 4.15 was the transition from a Fedora-based to a CentOS Stream-based , as a result, the e2fsck version was downgraded to 1.46, which does not support the orphan_file feature.
A prerequisite for mounting boot is to pass the check as you can see

Output of file /run/systemd/generator/boot.mount

However, during testing, the non-compatible version resulted in an error

System is unable to proceed with the update u…

View full answer

MainMan1998 · 2025-01-03T19:35:03Z

MainMan1998
Jan 3, 2025
Author

Hello,

I continue to research a solution ;)

I checked the filesystem, this one is on xfs except /boot who it is on ext4.

I checked the sha on openshift-machine-config-operator on the pod configuration, all sha are good.

I tried to delete the mcp worker / master without success.

I'm out of idea, I don't understand where could be the cause of the error, rpm-ostream is blocked because of dependency…

When I saw the log on the machine-config-daemon it try to access to something but it can't have access to it and then it crash:
F0103 17:58:01.109128 1377932 start.go:106] Failed to initialize single run daemon: error reading osImageURL from rpm-ostree: exit status 1

0 replies

niktsl · 2025-01-07T19:22:39Z

niktsl
Jan 7, 2025

in my case on one of the Nodes, /boot was not mounted

Check if your boot is there
$ ls -la /boot/

if not , try the following
$ mount /dev/disk/by-label/boot /boot

1 reply

MainMan1998 Jan 8, 2025
Author

It was the same issue :)
I revert the cluster to okd 4.15 and then I upgraded it to okd 4.16, when the nodes rebooted (during the update) I checked the /boot and I mounted manually with your command.
Then I upgraded to 4.17, same issue when workers / masters rebooted, the /boot didn't mount automatically.
I have one question: why nodes can boot if the /boot partition is not mounted ? Normally it shouldn't start ?

I tried to restart one worker, the /boot doesn't mount automatically.
Do you have the same issue ?

niktsl · 2025-01-09T19:39:43Z

niktsl
Jan 9, 2025

The issue occurred because Fedora CoreOS version 40 or later was used for the installation. This led to e2fsck being a version later than 1.47, which, by default, enabled the orphan_file feature.

One of the significant changes introduced with the system upgrade starting from version 4.15 was the transition from a Fedora-based to a CentOS Stream-based , as a result, the e2fsck version was downgraded to 1.46, which does not support the orphan_file feature.
A prerequisite for mounting boot is to pass the check as you can see

Output of file /run/systemd/generator/boot.mount

However, during testing, the non-compatible version resulted in an error

System is unable to proceed with the update unless the boot partition is manually mounted.

Now, regarding possible solutions, you could:

Disable the orphan_file feature.
Bypass the requirement for mounting the boot partition.
Wait for an upgrade of e2fsck, which I expect will be included in upcoming OKD updates.

1 reply

MainMan1998 Jan 10, 2025
Author

Ok, I will wait for the e2fsck upgrade.
Many thanks for help :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 machine-config-daemon issues #2078

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Upgrade 4.15.0-0.okd-2024-03-10-010116 to the 4.16.0-okd-scos.1 machine-config-daemon issues #2078

MainMan1998 Jan 1, 2025

Replies: 3 comments · 2 replies

MainMan1998 Jan 3, 2025 Author

niktsl Jan 7, 2025

MainMan1998 Jan 8, 2025 Author

niktsl Jan 9, 2025

MainMan1998 Jan 10, 2025 Author

MainMan1998
Jan 1, 2025

Replies: 3 comments 2 replies

MainMan1998
Jan 3, 2025
Author

niktsl
Jan 7, 2025

MainMan1998 Jan 8, 2025
Author

niktsl
Jan 9, 2025

MainMan1998 Jan 10, 2025
Author