Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container creation sometimes fails in the 32-bit ARM test job #1780

Open
EliahKagan opened this issue Jan 18, 2025 · 2 comments
Open

Container creation sometimes fails in the 32-bit ARM test job #1780

EliahKagan opened this issue Jan 18, 2025 · 2 comments

Comments

@EliahKagan
Copy link
Member

EliahKagan commented Jan 18, 2025

Current behavior 😯

One of the changes in #1777 was to add an arm32v7 test job that runs in a container on the new arm64 runner (cbe3793, fbc27b5), analogous to the preexisting i386 test job that runs in a container on an amd64 runner. It looks like this may be brittle, with container creation failing from time to time. This is the failure noted in #1778 (comment).

/usr/bin/docker start 4224fb6a96d4ae28ceca367700326843715626ffe3eb995cdbe03b0aa4e0b4b2
  Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply cgroup configuration: unable to start unit "docker-4224fb6a96d4ae28ceca3677003[26](https://github.com/GitoxideLabs/gitoxide/actions/runs/12845418981/job/35819462653#step:2:29)843715626ffe3eb995cdbe03b0aa4e0b4b2.scope" (properties [{Name:Description Value:"libcontainer container 4224fb6a96d4ae28ceca367700326843715626ffe3eb995cdbe03b0aa4e0b4b2"} {Name:Slice Value:"system.slice"} {Name:Delegate Value:true} {Name:PIDs Value:@au [4198]} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]): Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms): unknown
  Error: failed to start containers: 4224fb6a96d4ae28ceca367700326843715626ffe3eb995cdbe03b0aa4e0b4b2

This resembles,, and is probably identical to, some failures I had seen when testing #1777, that I had erroneously assumed (or hoped) were due to a hiccup in infrastructure rather than a persistent problem. This issue tracks that in case it is a persistent problem, which seems likely. If it happens again and no fix is apparent, I can revert the parts of #1777 that are about 32-bit testing, while keeping 87387c2 from it, which does not seem to have had any problems.

Something that probably isn't the cause

It is possible for a 64-bit ARM processor not to be capable of natively executing 32-bit ARM instructions--unlike 64-bit and 32-bit x86, this capability is not universal. When that happens, if binfmt_misc is configured to provide emulation via QEMU, a container of the incompatible architecture can still be run, but it will run much slower and some things may not work. However, while that was an early concern I had, as far as I can tell from the error that does not seem to be a factor here. Furthermore, in another repository, I checked in a reverse shell that no such architecture was enabled in binfmt_misc (EliahKagan/arm@496d9c1), and also even tried turning off binfmt_misc (EliahKagan/arm@efa15ff), and a 32-bit ARM binary was still able to run.

Expected behavior 🤔

The container specified in the container: key should start up at least as reliably in jobs on the ARM runner as other runners.

Git behavior

Not directly applicable, but Git does test on various platforms. Cursory inspection of the runs-on keys in this workflow suggests Git may not be using the new ubuntu-24.04-arm or ubuntu-22.04-arm GHA runners at this time.

Steps to reproduce 🕹

I'm unsure what factors trigger this, or if it is effectively random. It seems likely that it will happen again, but I'm not certain, so I'm opening this issue rather than immediately changing the workflow.

When I was working on the PR, I think it happened most often when I had two pushes separated by a very short time. My first thought was that it might have to do with caching. That is implausible, though, at least with respect to the caching of Rust dependencies that we are doing, because the failure happens much earlier, when the GitHub Actions runner software runs Docker to set up the job, before any steps of the job have begun.

@Byron
Copy link
Member

Byron commented Jan 19, 2025

Thanks for keeping track of this! My hope is that over time this issue will go away - the runners are still new and maybe the tracking software they have still has some shortcomings, growing pains if you will.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 24, 2025
In the AArch64/ARM64 (64-bit, non-containerized) test-fast job,
this uses the `ubuntu-22.04-arm` runner instead of the
`ubuntu-24.04-arm` runner. This is to avoid the errors described
in GitoxideLabs#1790, i.e., to work around rust-lang/rust#135867.

Such problems have not been observed on the 22.04 runner, including
in tests intended to find them, and switching to it seems to be a
complete workaround for the problem. In contrast, continuing to use
the 24.04 runner, but attempting to work around the problem by
switching from the stable to the beta channel, looks like it would
greatly decrease the frequency of the errors but not eliminate
them. A problem with `actions/checkout` failing is likewise
observed on the 24.04 runner only, so using 22.04 avoids that too.

Because that seems like a complete workaround, this also reverts
50da7cb (GitoxideLabs#1792). That is to say that the ARM64 test-fast job is
again in the `test-fast` matrix. It is capable of cancelling or
being cancelled by the other `test-fast` checks. Code duplication
in the workflow is somewhat decreased. The job will again block PR
auto-merge.

Similar errors do not seem to have occurred in the `test-32bit`
job that runs an arm32v7 Docker image in `ubuntu-24.04-arm`, and it
is not clear that changing the runner image would help with GitoxideLabs#1780,
nor even if that issue is still happening. Therefore, it is not
changed there at this time.

This affects only ARM Linux runners. The x86-64 runners continue to
use `ubuntu-latest`, which is currently resolved to `ubuntu-24.04`,
and that does not need to be changed. Likewise, the `macos-latest`
runners use ARM processors (Apple Silicon) and they are fine.

Various experiments were done in a separate workflow. This commit
also removes that workflow, because it is not actively needed
anymore, and because, if kept, it would have to be modified to
avoid running hundreds of extra checks on each and every push.
@EliahKagan

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants