Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More performant dotnet PE parser #3563

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

wagoodman
Copy link
Contributor

Description

Today the dotnet PE cataloger reads the entire binary into memory to make it available to the saferwall/pe lib for parsing. This is problematic since that could result in lots of unnecessary allocations.

Take for example scanning the dotnet sdk with dependencies (~14,000 dlls) downloaded; here's the total allocated (alloc_space in pprof) memory by today's cataloger:
Screenshot 2025-01-03 at 1 29 21 PM

This PR adds a PE parser that is only concerned with gathering the version resources used to determine the artifact's name and version. Since this ultimately reads much fewer bytes through the underlying file resolver we allocate much less memory overall:

Screenshot 2025-01-03 at 1 29 30 PM

In this case, this saves about 48GB of memory being allocated overall 🎉.

In terms of wall clock time, the original scan took ~80 seconds, with the adjustments on this branch it takes ~35 seconds 🎉.

This has no impact on unreleased memory (inuse_space in pprof) since in both cases any analyzed material is still released and results in the same number of packages (ish).

One caveats is that I'm still seeing a difference of ~20 packages from the total tested of ~14,000, so there is still something to work out / debug on this branch.

An alternative considered would be to plumb though mmap capabilities from stereoscope through to syft, as well as enhance the directory resolver similarly (mmap files read via that resolver). This is promising but more impactful and will take more planning. PE is a well known format and for our purposes we only need to parse a small portion to gather the raw version information.

Type of change

  • Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

TODO:

  • Add tests
  • Find out why scanning dotnet sdk misses 20 packages

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant