Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Docker inside a sysbox container sporadically causes all filesystem operations to fail #892

Open
TAGraves opened this issue Dec 18, 2024 · 2 comments

Comments

@TAGraves
Copy link

TAGraves commented Dec 18, 2024

We are running into an issue where very infrequently, Docker containers running inside our sysbox container on AWS EC2 fail to be able to do any operations in their filesystem.

The following reproduces the issue roughly one in ten thousand times:

docker run --runtime=sysbox-runc --hostname=syscont nestybox/alpine-docker:latest /bin/sh -c "dockerd & until docker version > /dev/null 2>&1; do sleep 1; done && docker --version && docker pull alpine:3.20 && docker run --pull never alpine:3.20 chmod 03775 /home"

When the issue is reproduced, chmod 03775 /home fails with chmod: /home: No such file or directory.

A full reproduction

Launch a Ubuntu 24 EC2 instance and run the following as root to setup the system to use sysbox/docker:

export DEBIAN_FRONTEND=noninteractive
DOCKER_VERSION="5:27.3.1-1~ubuntu.24.04~noble"
EXPECTED_DOCKER_VERSION="Docker version 27.3.1, build ce12230"
# Rollback containerd until sysbox 0.6.6: https://github.com/nestybox/sysbox/issues/879
CONTAINERD_VERSION="1.7.23-1"

apt-get update && apt-get upgrade -y && apt-get install -y tmux

apt-get update && \
	apt-get install -y ca-certificates curl && \
	install -m 0755 -d /etc/apt/keyrings && \
	curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc && \
	chmod a+r /etc/apt/keyrings/docker.asc && \
	echo \
	  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
	  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  tee /etc/apt/sources.list.d/docker.list > /dev/null && \
  apt-get update

apt-get install -y \
		--allow-downgrades \
		docker-ce=$DOCKER_VERSION \
		docker-ce-cli=$DOCKER_VERSION \
		docker-ce-rootless-extras=$DOCKER_VERSION \
		containerd.io=$CONTAINERD_VERSION

usermod -aG docker ubuntu

# Install sysbox
wget https://downloads.nestybox.com/sysbox/releases/v0.6.5/sysbox-ce_0.6.5-0.linux_amd64.deb
docker rm $(docker ps -a -q) -f || true
apt-get install -y ./sysbox-ce_0.6.5-0.linux_amd64.deb
systemctl enable sysbox

reboot

Then run the following to build the inner Docker container:

cat <<'EOF' > entry.sh
echo "Starting docker..."
dockerd >/dev/null 2>&1 &
sleep 5
echo "Docker started"

docker pull alpine:3.20

for ((i=1; ; i++)); do
  echo "$(date) - $i"
  docker run --pull never alpine:3.20 chmod 03775 /home || { echo "FAILED"; exit 44; }
  sleep 0.4
done
EOF

cat <<'EOF' > Dockerfile
FROM ubuntu:22.04
ARG DOCKER_VERSION="5:27.3.1-1~ubuntu.22.04~jammy"
ARG EXPECTED_DOCKER_VERSION="Docker version 27.3.1, build ce12230"
# Rollback containerd until sysbox 0.6.6: https://github.com/nestybox/sysbox/issues/879
ARG CONTAINERD_VERSION="1.7.23-1"

RUN apt-get update && \
		apt-get install -y ca-certificates curl && \
		apt-get install -y fuse-overlayfs && \
		install -m 0755 -d /etc/apt/keyrings && \
		curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc && \
		chmod a+r /etc/apt/keyrings/docker.asc && \
		echo \
		  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
		  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
	  tee /etc/apt/sources.list.d/docker.list > /dev/null && \
	  apt-get clean

RUN apt-get update && \
		apt-get install -y \
			--allow-downgrades \
			docker-ce=$DOCKER_VERSION \
			docker-ce-cli=$DOCKER_VERSION \
			docker-ce-rootless-extras=$DOCKER_VERSION \
			containerd.io=$CONTAINERD_VERSION && \
		apt-get clean

COPY entry.sh /entry.sh
RUN chmod +x /entry.sh

ENTRYPOINT ["/bin/bash", "-c"]
CMD ["/entry.sh"]
EOF

docker build -t repro .

Then run:

docker run --runtime sysbox-runc repro

This will run a loop inside the Sysbox container -- as stated before, this will fail roughly 1 in 10000 times although it is very sporadic.

Factors we've considered

We've reproduced this issue using various versions of Docker 26 and Docker 27 and using Ubuntu 22.04 and Ubuntu 24.04. We've reproduced it using sysbox-runc directly and using docker run --runtime sysbox-runc. We've seen it fail using both ext4 and xfs filesystems outside the Sysbox container. We haven't seen it fail when running Docker outside of Sysbox. We've seen it fail the first time the inner docker container runs on a machine, so it isn't caused by the repeated invocation of Docker inside the container. However, we attempted to reproduce it on GitHub actions using the following job and have not successfully reproduced it, so it may be specific to the Linux kernel on AWS's Ubuntu AMIs:

jobs:
  repro:
    runs-on: ubuntu-latest
    steps:
      - run: |
          wget https://downloads.nestybox.com/sysbox/releases/v0.6.5/sysbox-ce_0.6.5-0.linux_amd64.deb
          docker rm $(docker ps -a -q) -f || true
          sudo apt-get install ./sysbox-ce_0.6.5-0.linux_amd64.deb
          sudo systemctl status sysbox -n20

          docker run --runtime=sysbox-runc --hostname=syscont nestybox/alpine-docker:latest /bin/sh -c "dockerd & until docker version > /dev/null 2>&1; do sleep 1; done && docker --version && docker pull alpine:3.20 && i=1; while [ \$i -le 10000 ]; do echo \$i && docker run --pull never alpine:3.20 chmod 03775 /home && sleep 0.4; i=\$((i + 1)); done"

Other things we know

We've done some investigation inside the Sysbox container when this happens. It appears the overlayfs directory that Docker creates is totally broken in these cases. For example, in one case when we were exploring this issue when it occurred against the postgres Docker image, we ran

sudo ls /var/lib/docker/overlay2/e691c7a9a4426ded12e56cb89d599758efbecdd32154234305649f1625080519/merged/var/run

and saw there was a postgresql directory inside of that directory, as expected. However, trying to write a file to that postgresql directory and trying to move that postgresql directory both yielded No such file or directory errors, both from inside the Docker container and from outside the Docker container.

@rodnymolina
Copy link
Member

@TAGraves, thanks for the detailed description. Question: during problem reproduction, can you instantiate a second sysbox container? If so, is the problem seen in that second container too? At first glance, I can't think of anything obvious, so I'm just trying to narrow down the issue with these questions.

@rodnymolina
Copy link
Member

Also, can you please check if there's any relevant sysbox-related log in your journalctl? Otherwise, enable debug logs for the sysbox-mgr daemon and try to reproduce again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants