Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of the license matching tools #151

Open
goneall opened this issue Jan 26, 2018 · 9 comments
Open

Improve the performance of the license matching tools #151

goneall opened this issue Jan 26, 2018 · 9 comments

Comments

@goneall
Copy link
Member

goneall commented Jan 26, 2018

The method

public static DifferenceDescription isTextStandardLicense(License license, String compareText) throws SpdxCompareException {
is used to compare texts against the SPDX license list.

It currently iterates through all licenses and calls a very expensive comparison method on each license.

The performance can be substantially improved.

For example, run a faster, less precise comparison algorithm and only do the more expensive comparisons if the match is "close enough"

@wking
Copy link
Contributor

wking commented Jan 26, 2018 via email

@goneall
Copy link
Member Author

goneall commented Jan 26, 2018

This would speed up rejections but slow down positive matches.

This particular methods loops through all listed licenses, so for each positive match there will be several hundred negatives, so it will always be a benefit.

There is another method which compares specifically to a single license called by the method which looks at all listed licenses. We could implement a flag in that method or implement in the outer loop.

@avirlrma
Copy link

avirlrma commented Feb 6, 2018

@goneall I am sort of new here.Can you point me to where is the iteration in the method?

@goneall
Copy link
Member Author

goneall commented Feb 6, 2018

@aviral1701 I should have referenced the method with the iteration above. Here's the method with the iteration:

public static String[] matchingStandardLicenseIds(String licenseText) throws InvalidSPDXAnalysisException, SpdxCompareException {

A suggestion on implementing a solution - the isTextStandardLicense method is called inside the loop. It can also be called directly. Improving the performance of the isTextStandardLicense method will have broader.

@goneall
Copy link
Member Author

goneall commented Feb 7, 2018

Looking back on @wking post above, there is a good point:

Currently license-list-XML's match validation is generating a lot more positives

So, perhaps we should move the performance optimization to the outside loop

@avirlrma
Copy link

@goneall updates and queries:
1.) how do you want to go about this? I mean we can implement the intial design to do more rigorous search for only close matches.
2.) Also could you please elaborate on calling the function directly?
3.) Lastly, I have started working on improving the performance of isTextStandardLicense as it will help in any case.

@goneall
Copy link
Member Author

goneall commented Mar 11, 2018

@aviral1701 Sorry about the delayed response - I was traveling and just got back to a location where I can access github without issue.

1.) how do you want to go about this? I mean we can implement the intial design to do more rigorous search for only close matches.

We could change the method matchingStandardLicenseIds to only call isTextStandardLicense for a listed license only if it passes a less strict, much higher performance method first. This would solve this issue raised by @wking above. This would be inside the loop that iterates over the listed licenses.

2.) Also could you please elaborate on calling the function directly?

The method isTextStandardLicense is a public method supported for other use cases. One of the common uses is to confirm text which is stated to be a specific listed license is actually that license. In this use case, it would be less common to have a negative result.

3.) Lastly, I have started working on improving the performance of isTextStandardLicense as it will help in any case.

That is great - any improvements in the method would help in any case

@vlsi
Copy link

vlsi commented Jun 2, 2019

I guess the main culprit here is everybody wants to execute org.spdx.compare.LicenseCompareHelper#matchingStandardLicenseIds (identify license name from a text) multiple times. For instance, if a project has 100 dependencies, then one might want to call the method 100 times.

The problem with LicenseCompareHelper is that parses/tokenizes both license template and given license for each comparison. There are 400-500 license templates, so it results into 2(template and given text)*500(number of licenses to check)*100(number of dependencies) == 100'000 license text tokenizations.

If the tokenization could be avoided, then it would dramatically improve the performance.

@goneall
Copy link
Member Author

goneall commented Jun 3, 2019

@vlsi Good point - there really isn't a need to tokenize the template each time. Perhaps we can change the design to store a tokenized version of the license template which would improve the performance.

We could store the tokenized strings in the license object after it has been initially processed on the first compare and passed into

public CompareTemplateOutputHandler(String compareText) throws IOException {

Anytime the license template text is updated, we would need to re-tokenize the string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants