-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize cross reference object offset validation by avoiding nested loop #935
Optimize cross reference object offset validation by avoiding nested loop #935
Conversation
@@ -6,7 +6,7 @@ | |||
/// <summary> | |||
/// Used to uniquely identify and refer to objects in the PDF file. | |||
/// </summary> | |||
public readonly struct IndirectReference | |||
public readonly struct IndirectReference : IEquatable<IndirectReference> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing that dictionary insertion was the bottleneck initially made me think that Equals boxing might be the issue so I added IEquatable. However, this didn't make a noticeable difference in my particular case. Given how frequently this type is used as a dictionary key I think it is worth keeping (I notice that other PdfPig types do have this).
@@ -49,7 +49,7 @@ internal CrossReferenceTable(CrossReferenceType type, IReadOnlyDictionary<Indire | |||
Trailer = trailer ?? throw new ArgumentNullException(nameof(trailer)); | |||
CrossReferenceOffsets = crossReferenceOffsets ?? throw new ArgumentNullException(nameof(crossReferenceOffsets)); | |||
|
|||
var result = new Dictionary<IndirectReference, long>(); | |||
var result = new Dictionary<IndirectReference, long>(capacity: objectOffsets.Count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fly-by optimization since we're just copying the dictionary.
var bruteForceOffsets = BruteForceSearcher.GetObjectLocations(bytes); | ||
if (bruteForceOffsets.Count > 0) | ||
{ | ||
var builderOffsets = new Dictionary<IndirectReference, long>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this inside the if since it wasn't referenced outside
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about doing var builderOffsets = new Dictionary<IndirectReference, long>(bruteForceOffsets.Count);
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed & updated!
|
||
foreach (var item in bruteForceOffsets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this loo outside the foreach loop that starts on 33 was the real win. As far as I can tell, the two loops are independent, and we're still maintaining the order such that bruteForceOffsets are added after object offsets and thus could override them in builderOffsets.
I don't see a reason why this needs to be an inner loop.
Thanks for taking a look @BobLd ! |
@madelson thanks a lot for the PR! I've added a comment, let me know what you think. By any chance, is there a way for you to share the problematic pdf (not required, would just be nice for benchmarking)? |
Thanks @BobLd !
Agreed & addressed!
I looked into it and unfortunately it's not one I can share. However, any large PDF that hits this codepath probably reproduces the issue. Not sure if any of those are in the benchmark set already. This was over 300 pages of slides. |
looks great! thanks a lot for the contribution @madelson |
Appreciate the quick engagement @BobLd ! Is there an ETA for this appearing in a published NuGet version? Not sure what the release schedule is like. |
@madelson your changes will be made available as a nuget package overnight (pre released version), so you'll have that by tomorrow morning. The official release is not planned yet, but should be made available before end of year |
I've been using PdfPig for text extraction on various files.
Recently, I found 1 file which was taking ~23s to open. Profiling showed that the hotspot was:
After the changes here, the time to open the document drops to just 3s!