Component version to GitHub tag matching. #99

16Martin · 2024-11-11T21:50:32Z

I have been experiencing issues in bom findsources with capycli's GitHub interaction. Jobs take unexpectedly long and the memory consumption is correspondingly high (but isn't an issue in itself). I use capycli to process relatively large BOMs and, according to capycli's findings, I frequently have 400-500 third party components from GitHub.

I tracked these issues to how findsources maps component versions to tags on GitHub. Currently, capycli first retrieves the full list of a project's tags (get_github_info() in capycli.bom.findsources) and then iterates over this list, hoping to find a match to the version provided as a parameter to get_matching_tag().

There are projects like the tencentcloud sdk with tens of thousands of tags. Using the GitHub API, capycli has to retrieve these at chunks of 100 tags per call using Python's synchronous IO.

On average, get_matching_tag() does 109 negative comparisons for each tag it matches. This means on average in my use cases capycli has to fetch two pages worth of tags to match a component. This is amounts to retrieving tencentcloud sdk alone.

As far as I can tell, ...

get_github_info() is only ever used twice with both occurrences in capycli.bom.findsources. Both uses virtually directly feed into get_matching_tag().
get_matching_tag() is only ever used three times with all occurrences in capycli.bomfindsources. All uses are essentially immediately return-ed

Are there any uses of these methods I missed?

The text was updated successfully, but these errors were encountered:

16Martin · 2024-11-14T07:08:14Z

I am working on a new implementation that replaces get_github_info() and get_matching_tag().

While get_github_info() used to fetch all the tags and get_matching_tag() would search the full result set, the new approach joins these two methods and searches page by page as they are retrieved from the GitHub API.

The current implementation is based on the assumption that for each release version there is a corresponding tag in the repository and that the release version is encoded in the tags name, which can be retrieved using to_semver_string(). The new approach builds on that assumption even further and tries to guess the correct tag name from a tag that corresponds to a non-matching version.

Based on the assumption that projects follow a scheme for tags that encodes a semantic version (-like) into a tag,

inverted implementation of to_semver_string(). Instead of inferring a semver from a tag, this inverse implementation will infer a tag from a semver after analysing a tag retrieved from GitHub.

tngraf · 2024-11-14T17:56:03Z

Sounds good so far. Please ensure that the old implementation is still working as a fallback in the case that the apporach of your new and faster way does not work.

This was referenced Nov 16, 2024

Draft: FindSources w/o always fetching all tags #102

Closed

GitHub tag matching #103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Component version to GitHub tag matching. #99

Component version to GitHub tag matching. #99

16Martin commented Nov 11, 2024 •

edited

Loading

16Martin commented Nov 14, 2024

tngraf commented Nov 14, 2024

Component version to GitHub tag matching. #99

Component version to GitHub tag matching. #99

Comments

16Martin commented Nov 11, 2024 • edited Loading

16Martin commented Nov 14, 2024

tngraf commented Nov 14, 2024

16Martin commented Nov 11, 2024 •

edited

Loading