Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a source container that freezes the state of source repos #283

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

terrykong
Copy link
Contributor

Adds initial support for a "source" container that can either build from source given --build-args, or be overridden given extra --build-contexts.

Examples:

# All repos at `main`
docker buildx build -f Dockerfile.src --tag toolbox-src .

# Build flax at tag v0.7.4
docker buildx build -f Dockerfile.src --tag toolbox-src --build-arg REF_FLAX=v0.7.4  .

# Build with a flax you've checked out yourself (b/c you didn't want to mount ssh key in)
git clone https://github.com/google/flax.git /tmp/flax
docker buildx build -f Dockerfile.src --tag toolbox-src --build-context flax-source=/tmp/flax .

@terrykong terrykong requested a review from yhtang October 4, 2023 05:31
@terrykong terrykong changed the title Adds a source container that freezes dependencies Adds a source container that freezes the state of source repos Oct 4, 2023
Copy link
Collaborator

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speedy!

The source container is only meant to be used in CI for the following purposes:

  • protocol normalization: we want to support protocols like https, ssh, local file, etc., for fetching the various source repositories. Instead of letting the individual FW/model Dockerfiles handle this logic, we can do it all at once when building the source container. Once the repo is cloned into the source container, we will mount it in subsequent build processes so that from the perspective of the FW/model dockerfiles, they are always checking out a local repo.
  • code archival: the source container will be preserved in our container registry for reproducible builds, e.g. when the workflow need to be rerun.
  • achieving the above two goals while ensuring the FW/model Dockerfiles are still usable out-of-the-box in standalone mode.

Given the above requirements, I feel that we can make a minimal Dockerfile.src that looks like

FROM scratch
ARG SRC_PATH /src
# copy every folder under SRC_PATH to /src in the image
COPY ${SRC_PATH}/* /src

while the logic for handling different source protocols can be done in CI via a bash function in .github/workflows/scripts/clone-repo.sh:

function clone-repo() {
# $1: SRC, e.g. https://github.com/google/jax.git#jax-v0.4.16
# $2: DST, e.g. /src/jax-source
}

and then in _build_src.yaml, we can do something like:

...
jobs:
  build:
    steps:
      - name: check out source repos locally
        run: |
          for package in jax xla pax ...; do
            clone-repo ${!SRC_$package} /src/$package-source  # pseudo code for variable indirect

      - name: build source container
        # use build-push-action with one build-arg $SRC_PATH=/src
...

@terrykong
Copy link
Contributor Author

terrykong commented Oct 4, 2023

I can for sure move the main cloning logic outside of the dockerfile into a script, but I feel like by moving cloning outside of the dockerfile, we create more logic that lives outside of the dockerfile, which should be hermetic. One benefit of structuring it in this way is buildkit will parallelize the independent stages :)

The way I've initially proposed the structure is that the dockerfile by default needs no other instructions to build the source container, but we can customize it via build-args or build-contexts; whereas I think you're proposing cloning outside the dockerfile and copying it all in from the singular defeault build context.

Perhaps I'm not seeing something you're seeing. Could you share more of your thought process?

@yhtang
Copy link
Collaborator

yhtang commented Oct 4, 2023

Honestly, I don't have a 100% structured idea here as this is new ground for us :)

Some points are (not necessarily fully connected yet):

  1. The source container is more or less a fancy tarball. It just happens that we can reuse the docker registry infrastructure for storing this tarball. Otherwise, it can be difficult to store many large tarballs over time.
  2. The existing Dockerfile would still need extra logic for building the source container, for example, if we were to permit fancy syntax such as repo#PR (we currently support repo#tag, repo#branch, repo#SHA).
  3. We want the build process to be as simple as possible in terms of logic and code length. Putting the logic for processing the repo#ref in a script/function can help us avoid repetition and handle all repos with a for loop. It is unclear whether the script should run inside or outside the Dockerfile though.

@yhtang
Copy link
Collaborator

yhtang commented Oct 4, 2023

Also, what's your thought on how we gonna use the source container?

I can think of the following options:

docker export

This is where we literally treat the source container as a tarball. We then use docker export source-image:tag | tar xf to uncompress the tarball and then use the content as the source repo for FW/model Dockerfiles. The advantage of this approach is that no changes is needed at all to the existing FW/model Dockerfiles, hence ensuring them to remain self-contained.

buildx build --build-context

Buildx allows us to specify images as additional contexts, which will appear as pre-existing stages in multi-stage build:

$ docker buildx build --build-context source-container=docker-image://REPO@sha256:0123456789 

The FW/model Dockerfile will need to be modified as

COPY --from=source-container ...

@terrykong
Copy link
Contributor Author

can help us avoid repetition

That's a good point. One crummy thing is that dockerfiles aren't very flexible, so moderately complex functionality (like having variadic inputs) tend to be quite verbose. Sure, I can move it out to a script.

One thing I'm still struggling with is the level of complexity this introduces to our build process and whether we can make it simpler for those who want to manually build with our dockerfiles. I fear that we may be creating something that can only be run manually by inspecting our CI and chaining together the logic.

I’m also recalling that one property we want from this is the ability to pin all sources, but also be able to tweak something small if we find there to be an issue. So actually let me go back and make sure I start from what I think we need to pin the state of the ecosystem. I’ll assume we need something like a manifest.txt and say it might look like this:

# repo    ref     optional[url|path]                             clone    
jax       main    https://github.com/google/jax.git              True
t5x       main    https://github.com/google-research/t5x.git     True
paxml     main    ./paxml-local                                  True
seqio     main    https://github.com/google/seqio.git            False

This file will be used as the input to our entire build and should be present inside the container.

Then we can have some script, say ./pin.sh manifest.txt >manifest-pinned.txt that will replace the ref column with a SHA (ex: via git ls-remote) and this manifest-pinned.txt can also be used as a build input since ./pin.sh should be idempotent. This satisfies our "tweaking" requirement since we can use a fixed manifest-pinned.txt and change one SHA if needed.

Then it’s just a matter of cloning the repos to the path specified under the repo column.

I also include seqio as an example here, since there are some python dependencies that we may wish to pin and we can include them in our manifest, just set the clone column to false, so no cloning will happen; but when it comes time to pip-compile, we can look up the SHA in this table to pin it to the appropriate commit.

Also we can allow things like ./paxml-local if we have some local submodule or cloned repo that we want to copy in, in the case where someone wants to use, say, their private fork.

———

I’m actually now thinking the source container:

  1. Be a stage of Dockerfile.base (to simplify the number of docker files)
  2. Take in this manifest.txt
  3. Pin it inside
  4. And finally clone the repos

What are your thoughts on this?

@terrykong terrykong marked this pull request as draft October 10, 2023 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants