Improve the performance of the license matching tools #151

goneall · 2018-01-26T20:06:06Z

The method

tools/src/org/spdx/compare/LicenseCompareHelper.java

Line 549 in fcdf7d2

    
           public static DifferenceDescription isTextStandardLicense(License license, String compareText) throws SpdxCompareException {

is used to compare texts against the SPDX license list.

It currently iterates through all licenses and calls a very expensive comparison method on each license.

The performance can be substantially improved.

For example, run a faster, less precise comparison algorithm and only do the more expensive comparisons if the match is "close enough"

wking · 2018-01-26T20:35:03Z

On Fri, Jan 26, 2018 at 08:06:07PM +0000, goneall wrote: For example, run a faster, less precise comparison algorithm and only do the more expensive comparisons if the match is "close enough"

This would speed up rejections but slow down positive matches. Currently license-list-XML's match validation is generating a lot more positives (and ideally no negatives ;). So while faster rejections sounds good to me, we may want a flag to turn that logic off for cases where we expect positives.

goneall · 2018-01-26T22:14:17Z

This would speed up rejections but slow down positive matches.

This particular methods loops through all listed licenses, so for each positive match there will be several hundred negatives, so it will always be a benefit.

There is another method which compares specifically to a single license called by the method which looks at all listed licenses. We could implement a flag in that method or implement in the outer loop.

avirlrma · 2018-02-06T06:16:56Z

@goneall I am sort of new here.Can you point me to where is the iteration in the method?

goneall · 2018-02-06T18:42:46Z

@aviral1701 I should have referenced the method with the iteration above. Here's the method with the iteration:

tools/src/org/spdx/compare/LicenseCompareHelper.java

Line 606 in fcdf7d2

    
           public static String[] matchingStandardLicenseIds(String licenseText) throws InvalidSPDXAnalysisException, SpdxCompareException {

A suggestion on implementing a solution - the isTextStandardLicense method is called inside the loop. It can also be called directly. Improving the performance of the isTextStandardLicense method will have broader.

goneall · 2018-02-07T18:39:17Z

Looking back on @wking post above, there is a good point:

Currently license-list-XML's match validation is generating a lot more positives

So, perhaps we should move the performance optimization to the outside loop

avirlrma · 2018-02-25T04:37:50Z

@goneall updates and queries:
1.) how do you want to go about this? I mean we can implement the intial design to do more rigorous search for only close matches.
2.) Also could you please elaborate on calling the function directly?
3.) Lastly, I have started working on improving the performance of isTextStandardLicense as it will help in any case.

goneall · 2018-03-11T09:10:44Z

@aviral1701 Sorry about the delayed response - I was traveling and just got back to a location where I can access github without issue.

1.) how do you want to go about this? I mean we can implement the intial design to do more rigorous search for only close matches.

We could change the method matchingStandardLicenseIds to only call isTextStandardLicense for a listed license only if it passes a less strict, much higher performance method first. This would solve this issue raised by @wking above. This would be inside the loop that iterates over the listed licenses.

2.) Also could you please elaborate on calling the function directly?

The method isTextStandardLicense is a public method supported for other use cases. One of the common uses is to confirm text which is stated to be a specific listed license is actually that license. In this use case, it would be less common to have a negative result.

3.) Lastly, I have started working on improving the performance of isTextStandardLicense as it will help in any case.

That is great - any improvements in the method would help in any case

vlsi · 2019-06-02T07:23:41Z

I guess the main culprit here is everybody wants to execute org.spdx.compare.LicenseCompareHelper#matchingStandardLicenseIds (identify license name from a text) multiple times. For instance, if a project has 100 dependencies, then one might want to call the method 100 times.

The problem with LicenseCompareHelper is that parses/tokenizes both license template and given license for each comparison. There are 400-500 license templates, so it results into 2(template and given text)*500(number of licenses to check)*100(number of dependencies) == 100'000 license text tokenizations.

If the tokenization could be avoided, then it would dramatically improve the performance.

goneall · 2019-06-03T00:08:35Z

@vlsi Good point - there really isn't a need to tokenize the template each time. Perhaps we can change the design to store a tokenized version of the license template which would improve the performance.

We could store the tokenized strings in the license object after it has been initially processed on the first compare and passed into

tools/src/org/spdx/compare/CompareTemplateOutputHandler.java

Line 747 in 9554be2

public CompareTemplateOutputHandler(String compareText) throws IOException {

Anytime the license template text is updated, we would need to re-tokenize the string.

goneall added the help wanted label Jan 26, 2018

goneall mentioned this issue Apr 2, 2018

Use the scancode toolkit for check license function spdx/spdx-online-tools#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of the license matching tools #151

Improve the performance of the license matching tools #151

goneall commented Jan 26, 2018

wking commented Jan 26, 2018 via email

goneall commented Jan 26, 2018

avirlrma commented Feb 6, 2018

goneall commented Feb 6, 2018 •

edited

Loading

goneall commented Feb 7, 2018

avirlrma commented Feb 25, 2018

goneall commented Mar 11, 2018

vlsi commented Jun 2, 2019

goneall commented Jun 3, 2019

Improve the performance of the license matching tools #151

Improve the performance of the license matching tools #151

Comments

goneall commented Jan 26, 2018

wking commented Jan 26, 2018 via email

goneall commented Jan 26, 2018

avirlrma commented Feb 6, 2018

goneall commented Feb 6, 2018 • edited Loading

goneall commented Feb 7, 2018

avirlrma commented Feb 25, 2018

goneall commented Mar 11, 2018

vlsi commented Jun 2, 2019

goneall commented Jun 3, 2019

goneall commented Feb 6, 2018 •

edited

Loading