You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What would you like to be added:
The user should be able to configure with more nuance:
how to detect when two packages should be merged
how to persist package details for merged packages
For instance, today you can assume that any pkg.Package.Metadata will be a single struct that represents details for a single package. However, when merging packages today these structs MUST be the same. What is being proposed is to not necessarily require this, so the user can merge similar packages that might have different data cataloged.
This way when merging packages with fuzzier logic we can still keep all information (try not to drop anything).
From a user configuration perspective, this could look like so:
package:
merge:
# hash location paths when making package IDs
use-paths: true
# hash location layer information when making package IDs
use-layers: false
# hash package metadata struct when making package IDs
use-metadata: true
use-license: true
use-purl: true
# use....
... this is probably a bad example; I think a good part of this issue discussion should be about how the user would specify this.
Another way to do this is to make a single flag instead of a config for this:
syft --merge-pkg ENUMVALUE
(handwaving at the enum values now)
It's very possible that we should have specific heuristics in specific cataloging cases, so we don't expose these low level options directly, but instead allow or deny a list of heuristics.
Another variant, specify a list of package types to merge:
# be more aggressive on deduplication logic for packages of the given types
# note: this does not merge across types, so you can still only merge python packages with other python packages.
syft --merge-pkg-types python,ruby,binary,rpm
Why is this needed:
Today syft makes a few assumptions about what makes a distinctive package:
packages with the exact same core + metadata information in different project trees (different paths) are considered to be distinct
packages with the same core + metadata information in the same path but different layers are considered to NOT be distinct
This allows us to produce SBOMs with the "maximum resolution" so to speak -- you can answer questions about separate project trees (since they are not merged). However, this also has the downside of producing potentially large SBOMs and package graphs when logically they may be for the same application. Merging dependency trees may be ideal for a users use case. We have a few detailed examples of these:
This hints that we should make this behavior configurable.
Additional context:
Today we have deduplication of OS and binary packages, which is one of the only cases of cross-package-type deduplication behavior (by dropping). How would this be affected by the proposed configuration?
The text was updated successfully, but these errors were encountered:
we could start adding confidence values onto package detections, such merging operations where there are singular values have a more clear precedence (which field overrides the other)
should pkg.Package have a shadowPackages slice for what packages are subsumed
relationships also need to be reconciled (merging package graphs)
this might be made easier within an SBOMWriter interface (which takes a set of packages and relationships to write into an SBOM object). This same idea came up within the spooling results to disk.
What would you like to be added:
The user should be able to configure with more nuance:
For instance, today you can assume that any
pkg.Package.Metadata
will be a single struct that represents details for a single package. However, when merging packages today these structs MUST be the same. What is being proposed is to not necessarily require this, so the user can merge similar packages that might have different data cataloged.This requires us to "join" metadatas like so:
This way when merging packages with fuzzier logic we can still keep all information (try not to drop anything).
From a user configuration perspective, this could look like so:
... this is probably a bad example; I think a good part of this issue discussion should be about how the user would specify this.
Another way to do this is to make a single flag instead of a config for this:
(handwaving at the enum values now)
It's very possible that we should have specific heuristics in specific cataloging cases, so we don't expose these low level options directly, but instead allow or deny a list of heuristics.
Another variant, specify a list of package types to merge:
Why is this needed:
Today syft makes a few assumptions about what makes a distinctive package:
This allows us to produce SBOMs with the "maximum resolution" so to speak -- you can answer questions about separate project trees (since they are not merged). However, this also has the downside of producing potentially large SBOMs and package graphs when logically they may be for the same application. Merging dependency trees may be ideal for a users use case. We have a few detailed examples of these:
This hints that we should make this behavior configurable.
Additional context:
Today we have deduplication of OS and binary packages, which is one of the only cases of cross-package-type deduplication behavior (by dropping). How would this be affected by the proposed configuration?
The text was updated successfully, but these errors were encountered: