-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the performance of the license matching tools #151
Comments
On Fri, Jan 26, 2018 at 08:06:07PM +0000, goneall wrote:
For example, run a faster, less precise comparison algorithm and
only do the more expensive comparisons if the match is "close
enough"
This would speed up rejections but slow down positive matches.
Currently license-list-XML's match validation is generating a lot more
positives (and ideally no negatives ;). So while faster rejections
sounds good to me, we may want a flag to turn that logic off for cases
where we expect positives.
|
This particular methods loops through all listed licenses, so for each positive match there will be several hundred negatives, so it will always be a benefit. There is another method which compares specifically to a single license called by the method which looks at all listed licenses. We could implement a flag in that method or implement in the outer loop. |
@goneall I am sort of new here.Can you point me to where is the iteration in the method? |
@aviral1701 I should have referenced the method with the iteration above. Here's the method with the iteration:
A suggestion on implementing a solution - the |
Looking back on @wking post above, there is a good point:
So, perhaps we should move the performance optimization to the outside loop |
@goneall updates and queries: |
@aviral1701 Sorry about the delayed response - I was traveling and just got back to a location where I can access github without issue.
We could change the method
The method
That is great - any improvements in the method would help in any case |
I guess the main culprit here is everybody wants to execute The problem with If the tokenization could be avoided, then it would dramatically improve the performance. |
@vlsi Good point - there really isn't a need to tokenize the template each time. Perhaps we can change the design to store a tokenized version of the license template which would improve the performance. We could store the tokenized strings in the license object after it has been initially processed on the first compare and passed into
Anytime the license template text is updated, we would need to re-tokenize the string. |
The method
tools/src/org/spdx/compare/LicenseCompareHelper.java
Line 549 in fcdf7d2
It currently iterates through all licenses and calls a very expensive comparison method on each license.
The performance can be substantially improved.
For example, run a faster, less precise comparison algorithm and only do the more expensive comparisons if the match is "close enough"
The text was updated successfully, but these errors were encountered: