-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add purls (Package URLs) to PackageRecord
#63
base: main
Are you sure you want to change the base?
Conversation
Awesome CEP! :) |
|
||
## Abstract | ||
|
||
This CEP describes a change to the `PackageRecord` format and the corresponding `repodata.json` file to include `purls` (Package Urls) of repackaged packages to identify packages across multiple ecosystems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to the definition of a PackageRecord
? I struggle to find an authoritative source for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I believe that atm there is no actual "authorative" source.
There is this relatively old definition of a RepoDataRecord
: https://github.com/conda/schemas/blob/main/repodata-record-1.schema.json
There is this new effort to document the schemas better (conda/schemas#26) where it's also called RepoDataRecord
: https://github.com/conda/schemas/blob/b143c82a71833570fbe9be2313368b33c0e84726/conda_models/package_record.py#L23
And we have the definition in rattler: https://docs.rs/rattler_conda_types/latest/rattler_conda_types/struct.PackageRecord.html
In rattler (and I believe in conda as well), there is this distinction:
PackageRecord
: contains all the fields for a single entry in therepodata.json
RepoDataRecord
: inherits all fields fromPackageRecord
and adds fields to identify the origin of the data (channel, url, etc.)PrefixRecord
: inherits all fields fromRepoDataRecord
and additionally stores information about how the package was installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think the most "official" source for this is https://github.com/conda/conda/blob/e783377439ed1c413c6bffb9b785ae1d79c2392a/conda/models/records.py#L247. That module also offers some sort of definition in the top-level docstring.
Implementation of conda/ceps#63
This PR adds support for checking the satisfiability of the lock-file which includes pypi-dependencies. Purls have been added to the lock-file (conda/rattler#414) (See also: conda/ceps#63). This enables checking which conda packages will install which pypi packages without needing to check the internet. This ensures we can still check if a lock-file is up to date quickly. I did not profile this code but I think there are a lot of places we can improve the performance. Thats for a later PR. I also didn't add tests. I think we should but we can also do that in another PR. Closes #467 --------- Co-authored-by: Ruben Arts <ruben.arts@hotmail.com>
} | ||
``` | ||
|
||
PURL is already supported by dependency-related tooling like SPDX (see [External Repository Identifiers in the SPDX 2.3 spec](https://spdx.github.io/spdx-spec/v2.3/external-repository-identifiers/#f35-purl)), the [Open Source Vulnerability format](https://ossf.github.io/osv-schema/#affectedpackage-field), and the [Sonatype OSS Index](https://ossindex.sonatype.org/doc/coordinates); not having to wait years before support in such tooling arrives is valuable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also mention PEP-725 (WIP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Discourse thread has examples showing how the Spack community wants to use this kind of thing: https://discuss.python.org/t/pep-725-specifying-external-dependencies-in-pyproject-toml/31888/31
* We can keep this information close to the conda package description. | ||
* We can incrementally add `purls` through repodata patches. | ||
|
||
The downside is that the (already large) repodata.json file will grow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we add a separate-yet-adjacent purls.json
like we did with run_exports.json
in CEP-12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea and I will be supportive. Havin this metadata readily available would allow us to be listed in repology.org, for example! It would also play nicely with the (draft) PEP-725 for external metadata in PyPI.
However, I think this CEP right now is talking about serving metadata before we have discussed how to source it, define it and store it.
Whatever ends up in the repodata.json comes, in part, from the info/index.json
metadata inside the conda artifact. Then this is augmented with things like sha256 and final size by conda-index (because they cannot be known when the package is being archived).
So before we speak about repodata, we should discuss where in the inner artifact metadata we will store the PURL info. To answer that, we must answer where in the conda-build recipe we will include that information :D
IOW, I'd like to know your thoughts about:
- Where in the current
meta.yaml
we should define the PURLs.about
seems to be the most obvious one, which means this will probably end up ininfo/about.json
. - Whether to serve the PURLs separately in a
purls.json
or not. I honestly don't think putting it inrepodata.json
is a good idea. I get that it makes sense if you want to have a canonical link between PyPI in conda-forge so Pixi can solve things nicely. It might also be served inchanneldata.json
(since most of the time PURLs are tied to the source not the platform-dependent, target artifact).
Would this also help us address Repology's needs for supporting Conda packages ( repology/repology-updater#518 )? Edit: Nvm missed Jaime has the same idea |
I agree that While this would facilitate simplicity, avoid redundancy, and avoid errors in the recipe, I see the following downsides with that solution:
I do not have a strong opinion here since I am not too involved with the tools that would need to process that data. |
I think a broader question is whether
To put this in the context of the above, a given It might make sense to advocate for some changes to the
While i don't think much can be done about "where you got the source tarball" (because GitHub sources, etc), I don't think a recipe author should have to calculate all these things... but certainly could given the available data today: # meta.yaml
{% set version = "1.10.1" %}
package:
name: django
version: {{ version }}
# ...
about:
# ...
purls:
- pkg:pypi/django@{{ version }}
# this should be fully automated, either at build time (weird?) or trivially-derivable
- pkg:conda/{{ channel_targets.split(" ")[0] }}/django@1.10.1?subdir={{ target_platform }}&label={{ channel_targets.split(" ")[1] }}&build=py{{ py }}_{{ build_number }} So the above full purls:
- pkg:pypi/django@1.10.1
- pkg:conda/conda-forge/django@1.10.1?subdir=win-32&label=main&build=py35_0 |
Thinking about this more in the context of "accidental cross-ecosystem namesquatting" on zulip: as dependencies:
- pkg:pypi/django >=1.10.1,<1.11 treating everything after the whitespace as "this part is about conda" would still allow for all our variant business, but presumably could eventually be expanded to allow per-ecosystem fields... luckily, pypi only has semi-irrelevant stuff like |
I don't follow entirely. What would your example refer to? The PyPI package or the corresponding conda-forge package? |
Right, the user wants the corresponding # e.g. in pixi.toml
[dependencies]
# | a new package identifier
# V
"pkg:pypi/django" = ">=1.10.1,<1.11"
# ^
# | the conda constraints, in the MatchSpec grammar
"pkg:golang/github.com/rhysd/actionlint" = ">=1.7.7" Where this would be most excellent, for the PyPI case, is if the spec There is no consensus in An extreme case might be # e.g. in rattler-build recipe.yaml
recipe:
version: ${{ version }}
outputs:
# with fully-specified purls
- package:
name: fastapi
purl: pkg:pypi/fastapi@${{ version }}
dependencies:
run:
- pkg:pypi/starlette >=0.40.0,<0.42.0
- pkg:pypi/pydantic >=1.7.4,!=1.8,!=1.8.1,!=2.0.0,!=2.0.1,!=2.1.0,<3.0.0
- pkg:pypi/typing-extensions >=4.8.0
# or maybe it makes sense to CURIE them, using a `pip:`-like syntax
- package:
name: fastapi-standard
purl:
pkg:pypi:
- fastapi[standard]@${{ version }}
dependencies:
run:
- ${{ pin_subpackage("fastapi", exact=True) }}
- pkg:pypi:
- fastapi-cli[standard] >=0.0.5
- httpx >=0.23.0
- jinja2 >=2.11.2
- python-multipart >=0.0.7
- itsdangerous >=1.1.0
- pyyaml >=5.3.1
- ujson >=4.0.1,!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0
- orjson >=3.2.1
- email-validator >=2.0.0
- uvicorn[standard] >=0.12.0
- pydantic-settings >=2.0.0
- pydantic-extra-types >=2.0.0 The latter form would all but remove any package-naming impedance, making tools |
This CEP describes a change to the
PackageRecord
format and the correspondingrepodata.json
file to includepurls
(Package URLs of repackaged packages to identify packages across multiple ecosystems.rendered