-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CEP for the OCI storage of conda packages & repodata #70
base: main
Are you sure you want to change the base?
Conversation
|
cep-oci.md
Outdated
|
||
A conda package, in an OCI registry, should ship up to 3 layers: | ||
|
||
- The package itself, as a tarball. (mandatory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the contents of the package, or the compressed artifact (tarball or not)? In .conda
the outer layer is actually a ZIP file, and the inner ones are zstd tarballs. Would be helpful to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, should leave the tarball out of the sentence. It's the package data itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, some artifacts in defaults
are available both as .tar.bz2 and .conda. Should we restrict one package layer for each label? I don't think we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything would work fine if there are two layers, they should just not have the same mediaType.
cep-oci.md
Outdated
|
||
For example, a package like `xtensor-0.10.4-h431234.conda` would map to a OCI registry `conda-forge/linux-64/xtensor:0.10.4-h431234`. | ||
|
||
### Layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A table with the different layer mediaTypes
and the expected contents would make this section super easy to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to format and push to this PR! :) This is a collaborative document
What does a
Is it possible to refer to a layer directly by URL? In case of repodata:
I think this needs to be standardized too. |
|
||
We want to use OCI registries as a storage for conda packages. This CEP specifies how we lay out conda packages on an OCI registry. | ||
|
||
## Specification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distribution-spec contains useful definitions for these terms:
https://github.com/opencontainers/distribution-spec/blob/v1.0/spec.md#definitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I linked it :)
cep-oci.md
Outdated
|
||
- for a .tar.bz2 package, the mediaType is `application/vnd.conda.package.v1` | ||
- for a .conda package, the mediaType is `application/vnd.conda.package.v2` | ||
- for the `info` folder as gzip the mediaType is `application/vnd.conda.info.v1.tar+gzip` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The info folder can get heavy with licenses, test files and what not. Will we allow .tar+zstd
, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we currently have uploaded the whole conda-forge with the gzip
one, I would suggest to keep it like that for now. We could move to a zstd encoded one in the future.
Or we could just allow for any encoding (+gzip or +zstd)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to use the unmodified info-pip-23.3.1-py312haa95532_0.tar.zst out of the .conda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I used the tar.gz approach because of how easy it is to open and explore it with pure Python. But the unmodified one makes sense as well. WE could specify that application/vnd.conda.info.v1.tar+zst
will also be accepted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
.tar.gz just seemed odd although gzip is very ordinary it hasn't been used in conda packaging.
Co-authored-by: Matthew R. Becker <beckermr@users.noreply.github.com>
Co-authored-by: jaimergp <jaimergp@users.noreply.github.com>
Yeah, the nice thing is that the URLs map directly to the OCI registry. There just needs to be some post-processing for the tags. E.g. instead of
We ask for
And for packages
We ask for
|
One thing that also needs to go into this document is how tags are formatted. Some versions cannot be translated to tags directly because of the rules of OCI registries. The functions are here: And when a package name starts with |
Food for thought: implement subdirs with platform metadata per layer. |
For reference, rules are here: https://github.com/opencontainers/distribution-spec/blob/v1.0/spec.md#pulling-manifests |
Hmm, AFAIK we are rather using |
Yeah, but regular HTTPs requests don't work. That's why we need a middleware or some other layer that converts |
Co-authored-by: jaimergp <jaimergp@users.noreply.github.com>
Added a commit but couldn't push it here. |
Looks good, @Hind-M. Maybe you can make a PR against my branch (https://github.com/wolfv/ceps/tree/oci-cep). I can then merge it there, and the PR will be updated. |
The rendered version for convenience: https://github.com/wolfv/ceps/blob/oci-cep/cep-oci.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question.
cep-oci.md
Outdated
#### Implementation (conda / mamba / rattler) | ||
|
||
##### mamba | ||
|
||
In order to fetch packages from an OCI registry, we need to set a mirror (can be more than one) for the channel to be used (e.g `conda-forge`). | ||
This can be done in the rc file as follows: | ||
|
||
``` | ||
mirrored_channels: | ||
conda-forge: ["oci://ghcr.io/channel-mirrors/conda-forge"] | ||
``` | ||
|
||
When a user requests installing a package (with the configuration set above, and using `conda-forge` channel), a set of requests to fetch `repodata.json` are first performed as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we marking these as mirrored instead making them first-class things via using the normal channels config setting with an oci://
prefix to mark OCI channels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is orthogonal - you should be able to use the oci://
URL as your only channel as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why do we need the mirrored_channels
option at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not in this CEP? Specifying how we expect channel mirroring works could have merits. Specifically, we should define where the repodata.json
file comes from when we have multiple mirrors.
Ideally we would also define how we combine download counts from multiple mirrors (but all that would definitely belong into a different CEP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we should handle mirroring in another CEP I would think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's orthogonal. I changed it to not mention mirrors anymore.
@conda/steering-council This vote falls under the "Enhancement Proposal Approval" policy of the conda governance policy, If you have questions concerning the proposal, you may also leave a comment or code review. It needs 60% of the Steering Council to vote yes to pass. This vote will end on 2024-09-02, End of Day, Anywhere on Earth (AoE). To vote, please use the form below: @xhochy (Uwe Korn)
@CJ-Wright (Christopher J. 'CJ' Wright)
@mariusvniekerk (Marius van Niekerk)
@goanpeca (Gonzalo Peña-Castellanos)
@chenghlee (Cheng H. Lee)
@ocefpaf (Filipe Fernandes)
@marcelotrevisani (Marcelo Duarte Trevisani)
@msarahan (Michael Sarahan)
@mbargull (Marcel Bargull)
@jakirkham (John Kirkham)
@jezdez (Jannis Leidel)
@wolfv (Wolf Vollprecht)
@jaimergp (Jaime Rodríguez-Guerra)
@kkraus14 (Keith Kraus)
@baszalmstra (Bas Zalmstra)
|
cep-oci.md
Outdated
<table> | ||
<tr><td> Title </td><td> OCI registries as conda channels </td> | ||
<tr><td> Status </td><td> Proposed </td></tr> | ||
<tr><td> Author(s) </td><td> Wolf Vollprecht <wolf@prefix.dev></td></tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Hind-M Could you please add yourself to the author list?
cep-oci.md
Outdated
For `jlap`, the following mediaType is used: | ||
|
||
- `application/vnd.conda.jlap.v1` | ||
|
||
The `jlap` file should also be stored under the `<channel>/<subdir>/repodata.json` path as an additional layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the JLAP proposal has not been accepted, and sharded repodata approved instead, it may not be needed to keep this in here.
> [!NOTE] | ||
> **Package names in the conda world** | ||
> | ||
> The following regex is given by [`conda/schemas`](https://github.com/conda/schemas/blob/473708ac97283708d6664cbd89b8049ad1623489/common-1.schema.json#L58-L82) for a valid package name: `^[a-z0-9_](?!_)[._-]?([a-z0-9]+(\.|-|_|$))*$` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to not rely on conda/schema alone, and verify with the conda code base about the regex, in case the schema isn't up-to-date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these schemas are probably wrong in some way.
|
||
The regex expresses that names can only start with an alphanumeric letter. | ||
|
||
In `conda`, names can start with an underscore and it is used by conda-forge (e.g. `_libgcc_mutex`). For this reason, we prepend packages with a leading underscore with the string `zzz`. The name would thus be changed to `zzz_libgcc_mutex`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In `conda`, names can start with an underscore and it is used by conda-forge (e.g. `_libgcc_mutex`). For this reason, we prepend packages with a leading underscore with the string `zzz`. The name would thus be changed to `zzz_libgcc_mutex`. | |
In `conda`, names can start with an underscore and it is used by conda-forge (e.g. `_libgcc_mutex`). For this reason, we prepend packages with a leading underscore with the string `internal`. The name would thus be changed to `internal_libgcc_mutex`. |
internal
seems more fitting than zzz
💤
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid we might be late here. ~74 packages are mirrored like this already. We would need to delete and re-mirror? Or can can we just copy images and labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how can we safely remove internal_
from the names? It might be something being used already. What about internal___
(triple underscore)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below on making this not ambiguous by merging this part of the spec with other encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid we might be late here. ~74 packages are mirrored like this already. We would need to delete and re-mirror? Or can can we just copy images and labels?
The mirrors are proofs of concept from the conda community's perspective (I think from conda-forge as well), so I don't think these matter. Re-mirroring 74 packages seems a small price to pay for a better spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agreed @jezdez readding 74 packages is simple enough.
- `+` is replaced by `__p__` | ||
- `!` is replaced by `__e__` | ||
- `=` is replaced by `__eq__` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That list seems to be a little random, it looks like Python dunder methods, but not quite valid, how about this instead?
- `+` is replaced by `__p__` | |
- `!` is replaced by `__e__` | |
- `=` is replaced by `__eq__` | |
- `+` is replaced by `__add__` | |
- `!` is replaced by `__not__` | |
- `=` is replaced by `__eq__` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p
for plus
and e
for exclamation
, I guess? These characters might not mean "adding" or "negating" in a version/build string, so I'm not sure whether this is a good way to solve the "randomness" here. These names are only for the OCI names, anyway, right? They are renamed once downloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p
forplus
ande
forexclamation
, I guess?
Yes!
These mappings (prepending the package names starting with _
with zzz
and replacing +
etc for tags) are really only intended to make them valid in order to be stored in an OCI registry (correct me if I'm wrong @wolfv). It doesn't impact anything elsewhere as the names and tags are still visible (by users when downloaded for example) as their original strings.
So I don't think having internal
instead of zzz
(same for the tags suggestions) would really matter (apart from making sure that it wouldn't conflict with potential similar existing strings).
While something more explicit would be preferable, given that a bunch of packages have already been mirrored that way as @jaimergp mentioned, I would say to keep it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a standard encoding scheme we can rely on here instead of inventing our own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we insist of a special encoding, we can add underscores themselves as __under__
, which would render our scheme unambiguous, since an existing package with __under__
in it would be mapped to __under____under__under__under____under__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are conda package names case sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your findings do you want to change anything with regard to the name mangling suggested above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conda package names are always lower case, so presumably case insensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baszalmstra we might as well have everything lowercase since that is what the OCI registry will look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case insensitive filesystems are usually case preserving though.
|
||
A given conda-package is identified by a URL like `<subdir>/<package-name>-<version>-<build>.<ext>` where `<subdir>` is the platform and architecture, `<package-name>` is the name of the package, `<version>` is the version of the package, `<build>` is the build string of the package, and `<ext>` is the extension of the package file. | ||
|
||
#### Mapping the package name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these "mappings" reverted to their original names once downloaded/extracted? I would assume so and I think that should be part of the CEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note for that.
I am emeritus so I cannot vote here, but IMHO the issues around encoding/decoding above need to be resolved before this spec can be passed. |
It seems like there is a fair amount of discussion still happening here for an active vote. Should we cancel the vote and reschedule after that discussion has reached a conclusion? |
|
||
The regex expresses that names can only start with an alphanumeric letter. | ||
|
||
In `conda`, names can start with an underscore and it is used by conda-forge (e.g. `_libgcc_mutex`). For this reason, we prepend packages with a leading underscore with the string `zzz`. The name would thus be changed to `zzz_libgcc_mutex`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is also a package called zzz_libgcc_mutex?
This seems like a possible attack vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to symbol name mangling I guess? Perhaps we should prefix all packages with a common prefix and a special one for underscores?
- `+` is replaced by `__p__` | ||
- `!` is replaced by `__e__` | ||
- `=` is replaced by `__eq__` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a standard encoding scheme we can rely on here instead of inventing our own?
The text itself mentiones that package names can start with an underscore. _libgcc_mutex is the example given.
Perhaps we can learn from symbol name mangling instead of coming up with a unique scheme?
I think we can wait and see if we reach a conclusion before the voting deadline, and possibly extend it if necessary. If not, we would definitely reschedule yes. |
I like this proposal by @baszalmstra, but we will need to remirror the whole thing. Another alternative, as proposed by @wolfv, is to SHA256 encode the names and forget about it. You can map back by checking the internal metadata. This will also require a complete remirror. But we can also simply SHA256 encode the name of container images that start with an underscore and leave everything else untouched. This wouldn't require a full remirror. |
I think the name length limit is the bigger issue here though. I guess we'll need a fixed length hash of the name. |
Let me clarify some things: Name attacks are not really possible (for now) since we also mirror the repodata from conda-forge / Anaconda.org. From the repodata, we just use the SHA256 hash to directly reference the right blob. The names & tags for the individual packages are mainly "cosmetic". Even if we would run our own indexing, we would refer back to the stored But I can sympathize with a mapping that would disallow any such overlaps. I also think that re-mirroring might not be a big issue, since we just need to move the names / tags in the OCI registry to the right places (SHA hashes and packages will stay the same). |
So just to wrap up here:
Should we think of handling exceeding the max limit of characters as well? (I'm not sure if this is something already taken care of). |
We CANNOT only prepend to packages that start with an underscore. This would declare part of the package namespace off limits for all conda users. That action requires at minimum a separate CEP where we codify package names formally. We have to preprend to everything. And yes we need to handle the max characters limit properly in this CEP. |
Where is the specification for the length limit? |
Tags have a limit of 128 characters: https://docs.docker.com/reference/cli/docker/image/tag/ It is hard to know if people want to browse an OCI registry. For sure github provides searches based on image names. See for example the existing OCI mirror (https://github.com/orgs/channel-mirrors/packages) based on a beta spec. |
FYI, and considering the still ongoing discussions, the vote has been postponed to a later date that is yet to be determined. |
As a user, as long as https://conda-forge.org/packages/ (or https://bioconda.github.io/conda-package_index.html) has all the correct metadata and points to the correct images in the oci registry, I do not mind if the oci image name itself is hashed. Concretely, I would be okay with
which has the format If it's important to keep some of the original package name in the oci image name, another option is to strip away the invalid characters in the names/tags and add the hash next to it:
Something like this:
The stripped package names and tags can also be trimmed to make sure they are under the character limits. |
Fill in the technical details of the OCI registry storage for reference.