Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component version to GitHub tag matching. #99

Open
16Martin opened this issue Nov 11, 2024 · 2 comments
Open

Component version to GitHub tag matching. #99

16Martin opened this issue Nov 11, 2024 · 2 comments

Comments

@16Martin
Copy link
Collaborator

16Martin commented Nov 11, 2024

I have been experiencing issues in bom findsources with capycli's GitHub interaction. Jobs take unexpectedly long and the memory consumption is correspondingly high (but isn't an issue in itself). I use capycli to process relatively large BOMs and, according to capycli's findings, I frequently have 400-500 third party components from GitHub.

I tracked these issues to how findsources maps component versions to tags on GitHub. Currently, capycli first retrieves the full list of a project's tags (get_github_info() in capycli.bom.findsources) and then iterates over this list, hoping to find a match to the version provided as a parameter to get_matching_tag().

There are projects like the tencentcloud sdk with tens of thousands of tags. Using the GitHub API, capycli has to retrieve these at chunks of 100 tags per call using Python's synchronous IO.

On average, get_matching_tag() does 109 negative comparisons for each tag it matches. This means on average in my use cases capycli has to fetch two pages worth of tags to match a component. This is amounts to retrieving tencentcloud sdk alone.

As far as I can tell, ...

  • get_github_info() is only ever used twice with both occurrences in capycli.bom.findsources. Both uses virtually directly feed into get_matching_tag().
  • get_matching_tag() is only ever used three times with all occurrences in capycli.bomfindsources. All uses are essentially immediately return-ed

Are there any uses of these methods I missed?

@16Martin
Copy link
Collaborator Author

I am working on a new implementation that replaces get_github_info() and get_matching_tag().

While get_github_info() used to fetch all the tags and get_matching_tag() would search the full result set, the new approach joins these two methods and searches page by page as they are retrieved from the GitHub API.

The current implementation is based on the assumption that for each release version there is a corresponding tag in the repository and that the release version is encoded in the tags name, which can be retrieved using to_semver_string(). The new approach builds on that assumption even further and tries to guess the correct tag name from a tag that corresponds to a non-matching version.

Based on the assumption that projects follow a scheme for tags that encodes a semantic version (-like) into a tag,

inverted implementation of to_semver_string(). Instead of inferring a semver from a tag, this inverse implementation will infer a tag from a semver after analysing a tag retrieved from GitHub.

@tngraf
Copy link
Collaborator

tngraf commented Nov 14, 2024

Sounds good so far. Please ensure that the old implementation is still working as a fallback in the case that the apporach of your new and faster way does not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants