Adds a source container that freezes the state of source repos #283

terrykong · 2023-10-04T05:31:10Z

Adds initial support for a "source" container that can either build from source given --build-args, or be overridden given extra --build-contexts.

Examples:

# All repos at `main`
docker buildx build -f Dockerfile.src --tag toolbox-src .

# Build flax at tag v0.7.4
docker buildx build -f Dockerfile.src --tag toolbox-src --build-arg REF_FLAX=v0.7.4  .

# Build with a flax you've checked out yourself (b/c you didn't want to mount ssh key in)
git clone https://github.com/google/flax.git /tmp/flax
docker buildx build -f Dockerfile.src --tag toolbox-src --build-context flax-source=/tmp/flax .

yhtang

Speedy!

The source container is only meant to be used in CI for the following purposes:

protocol normalization: we want to support protocols like https, ssh, local file, etc., for fetching the various source repositories. Instead of letting the individual FW/model Dockerfiles handle this logic, we can do it all at once when building the source container. Once the repo is cloned into the source container, we will mount it in subsequent build processes so that from the perspective of the FW/model dockerfiles, they are always checking out a local repo.
code archival: the source container will be preserved in our container registry for reproducible builds, e.g. when the workflow need to be rerun.
achieving the above two goals while ensuring the FW/model Dockerfiles are still usable out-of-the-box in standalone mode.

Given the above requirements, I feel that we can make a minimal Dockerfile.src that looks like

FROM scratch
ARG SRC_PATH /src
# copy every folder under SRC_PATH to /src in the image
COPY ${SRC_PATH}/* /src

while the logic for handling different source protocols can be done in CI via a bash function in .github/workflows/scripts/clone-repo.sh:

function clone-repo() {
# $1: SRC, e.g. https://github.com/google/jax.git#jax-v0.4.16
# $2: DST, e.g. /src/jax-source
}

and then in _build_src.yaml, we can do something like:

...
jobs:
  build:
    steps:
      - name: check out source repos locally
        run: |
          for package in jax xla pax ...; do
            clone-repo ${!SRC_$package} /src/$package-source  # pseudo code for variable indirect

      - name: build source container
        # use build-push-action with one build-arg $SRC_PATH=/src
...

terrykong · 2023-10-04T16:58:44Z

I can for sure move the main cloning logic outside of the dockerfile into a script, but I feel like by moving cloning outside of the dockerfile, we create more logic that lives outside of the dockerfile, which should be hermetic. One benefit of structuring it in this way is buildkit will parallelize the independent stages :)

The way I've initially proposed the structure is that the dockerfile by default needs no other instructions to build the source container, but we can customize it via build-args or build-contexts; whereas I think you're proposing cloning outside the dockerfile and copying it all in from the singular defeault build context.

Perhaps I'm not seeing something you're seeing. Could you share more of your thought process?

yhtang · 2023-10-04T17:34:46Z

Honestly, I don't have a 100% structured idea here as this is new ground for us :)

Some points are (not necessarily fully connected yet):

The source container is more or less a fancy tarball. It just happens that we can reuse the docker registry infrastructure for storing this tarball. Otherwise, it can be difficult to store many large tarballs over time.
The existing Dockerfile would still need extra logic for building the source container, for example, if we were to permit fancy syntax such as repo#PR (we currently support repo#tag, repo#branch, repo#SHA).
We want the build process to be as simple as possible in terms of logic and code length. Putting the logic for processing the repo#ref in a script/function can help us avoid repetition and handle all repos with a for loop. It is unclear whether the script should run inside or outside the Dockerfile though.

yhtang · 2023-10-04T17:44:13Z

Also, what's your thought on how we gonna use the source container?

I can think of the following options:

`docker export`

This is where we literally treat the source container as a tarball. We then use docker export source-image:tag | tar xf to uncompress the tarball and then use the content as the source repo for FW/model Dockerfiles. The advantage of this approach is that no changes is needed at all to the existing FW/model Dockerfiles, hence ensuring them to remain self-contained.

`buildx build --build-context`

Buildx allows us to specify images as additional contexts, which will appear as pre-existing stages in multi-stage build:

$ docker buildx build --build-context source-container=docker-image://REPO@sha256:0123456789

The FW/model Dockerfile will need to be modified as

COPY --from=source-container ...

terrykong · 2023-10-05T05:24:17Z

can help us avoid repetition

That's a good point. One crummy thing is that dockerfiles aren't very flexible, so moderately complex functionality (like having variadic inputs) tend to be quite verbose. Sure, I can move it out to a script.

One thing I'm still struggling with is the level of complexity this introduces to our build process and whether we can make it simpler for those who want to manually build with our dockerfiles. I fear that we may be creating something that can only be run manually by inspecting our CI and chaining together the logic.

I’m also recalling that one property we want from this is the ability to pin all sources, but also be able to tweak something small if we find there to be an issue. So actually let me go back and make sure I start from what I think we need to pin the state of the ecosystem. I’ll assume we need something like a manifest.txt and say it might look like this:

# repo    ref     optional[url|path]                             clone    
jax       main    https://github.com/google/jax.git              True
t5x       main    https://github.com/google-research/t5x.git     True
paxml     main    ./paxml-local                                  True
seqio     main    https://github.com/google/seqio.git            False

This file will be used as the input to our entire build and should be present inside the container.

Then we can have some script, say ./pin.sh manifest.txt >manifest-pinned.txt that will replace the ref column with a SHA (ex: via git ls-remote) and this manifest-pinned.txt can also be used as a build input since ./pin.sh should be idempotent. This satisfies our "tweaking" requirement since we can use a fixed manifest-pinned.txt and change one SHA if needed.

Then it’s just a matter of cloning the repos to the path specified under the repo column.

I also include seqio as an example here, since there are some python dependencies that we may wish to pin and we can include them in our manifest, just set the clone column to false, so no cloning will happen; but when it comes time to pip-compile, we can look up the SHA in this table to pin it to the appropriate commit.

Also we can allow things like ./paxml-local if we have some local submodule or cloned repo that we want to copy in, in the case where someone wants to use, say, their private fork.

———

I’m actually now thinking the source container:

Be a stage of Dockerfile.base (to simplify the number of docker files)
Take in this manifest.txt
Pin it inside
And finally clone the repos

What are your thoughts on this?

Adds a source container that freezes dependencies

5fa58df

terrykong requested a review from yhtang October 4, 2023 05:31

terrykong changed the title ~~Adds a source container that freezes dependencies~~ Adds a source container that freezes the state of source repos Oct 4, 2023

yhtang reviewed Oct 4, 2023

View reviewed changes

terrykong marked this pull request as draft October 10, 2023 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a source container that freezes the state of source repos #283

Adds a source container that freezes the state of source repos #283

terrykong commented Oct 4, 2023

yhtang left a comment

terrykong commented Oct 4, 2023 •

edited

Loading

yhtang commented Oct 4, 2023

yhtang commented Oct 4, 2023

terrykong commented Oct 5, 2023

Adds a source container that freezes the state of source repos #283

Are you sure you want to change the base?

Adds a source container that freezes the state of source repos #283

Conversation

terrykong commented Oct 4, 2023

yhtang left a comment

Choose a reason for hiding this comment

terrykong commented Oct 4, 2023 • edited Loading

yhtang commented Oct 4, 2023

yhtang commented Oct 4, 2023

docker export

buildx build --build-context

terrykong commented Oct 5, 2023

terrykong commented Oct 4, 2023 •

edited

Loading

`docker export`

`buildx build --build-context`