Squashed all layers #3138

tomersein · 2024-08-20T05:39:33Z

This PR tries to solve the squash-with-all-layer resolver issue, aligned to the newest version of syft.
Please let me know how to proceed further, I guess the solution here is not perfect, but it does knows how to handle deleted packages.

part of - #15

Signed-off-by: tomersein <[email protected]>

dbrugman · 2024-08-20T08:50:04Z

@tomersein - I know very little about the Syft internals, and I'm trying to understand this PR. From the code and comments I understand that the new option will catalog packages from all layers, but then only include packages that are visible in the squashed file-system. How is that different from the regular squashed scope (or, I could probably rephrase this to: what is the difference between 'cataloging' and 'including')?

My main concern is whether this would (eventually) help to fix issue #1818

Many thanks!

tomersein · 2024-08-20T09:12:01Z

hi @dbrugman ,
In this PR I am trying to display only packages which exists in the squashed layer, and in case they are, to include all of the layers they exist in so we can track down in which layer they were added.

dbrugman · 2024-08-20T10:59:17Z

Got it, thanks @tomersein

kzantow · 2024-08-29T14:40:56Z

Hi @tomersein -- thanks for the contribution. I don't think we would want to merge this as-is, though. I wonder if there are any other things we may be able to do in order for you to accomplish what you're hoping to achieve.

So I understand correctly: the use case is to be able to find the layer which introduced a package, right?

tomersein · 2024-08-29T14:44:31Z

yes correct @kzantow , let me know what are the gaps so I can push some fixes \ improvements.
I want to add some more information according to your meeting yesterday:

the advantage in this solution that you need to scan only once. When an end user wants to see vulnerabilities in his container, all-layers can make him confuse since some of them doesn't exist anymore.
This solution can help users to fix their vulnerabilities by updating the relevant layer the vulnerability started from.

@kzantow - please see my notes after the meeting yesterday
@wagoodman I am available to do some fixes in case needed, just let me know :)

TimBrown1611 · 2024-09-12T20:02:47Z

any update? :) @wagoodman

Signed-off-by: tomersein <[email protected]>

tomersein · 2024-09-20T15:47:16Z

did some static analysis corrections and all checks are now passed
@kzantow @wagoodman

wagoodman · 2024-10-07T19:06:55Z

@tomersein thank you for submitting a candidate solution to solve the problem of tracking the layer-of-first-attribution problem.

Let me first summarize how this PR is achieving attribution. The first change involves adding a new file Resolver, which makes use of the squashed resolver and all-layer resolver based on the use case. The second change is adding IsSquashedAllLayersResolver and IsSquashedLayer bools to the core LocationMetadata struct. The new file resolver will raise up locations that positively confirm if the location is from the squashed layer and if the new resolver is used. The last change is to update the syft json formatter to filter out all packages that has no locations from a squashed layer. This in combination with the existing deduplication logic would yield in the same number of packages found but additional layer attributions.

Take for example a (rather silly) Dockerfile:

FROM ubuntu:latest
RUN apt update -y
RUN apt install -y jq
RUN apt install -y vim
RUN apt install -y wget curl

And after build:

$ docker inspect localhost/squash-all-layers | jq '.[].RootFS.Layers'

[
  "sha256:c26ee3582dcbad8dc56066358653080b606b054382320eb0b869a2cb4ff1b98b",
  "sha256:5ba46f5cab5074e141556c87b924bc3944507b12f3cd0f71c5b0aa3982fb3cd4",
  "sha256:1fde57bfea7ecd80e6acc2c12d90890d32b7977fec17495446678eb16604d8c7",
  "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
  "sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756"
]

Here we'll have multiple copies of the DPKG status file, which means classically we'll use the last layer for all evidence locations for packages (at least when it comes to the primary evidence location for the status file).

Let's take a look at just vims locations:

cat /tmp/after.json| jq '.artifacts[] | select(.name == "vim").locations'

[
  {
    "path": "/usr/share/doc/vim/copyright",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/usr/share/doc/vim/copyright",
    "annotations": {
      "evidence": "supporting"
    }
  },
  {
    "path": "/var/lib/dpkg/info/vim.md5sums",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/var/lib/dpkg/info/vim.md5sums",
    "annotations": {
      "evidence": "supporting"
    }
  },
  {
    "path": "/var/lib/dpkg/status",
    "layerID": "sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756",
    "accessPath": "/var/lib/dpkg/status",
    "annotations": {
      "evidence": "primary"
    }
  },
  {
    "path": "/var/lib/dpkg/status",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/var/lib/dpkg/status",
    "annotations": {
      "evidence": "primary"
    }
  }
]

Note that we see the original layer the package was added (sha256:9b6...c33) as well as the final unrelated modification of the status file (sha256:409...756). This is great! This is exactly what we're looking for in terms of results. There might be some debate around including one and only one spot for primary evidence, but lets ignore that for now.

Here's what I see when running a before and after:

❯ syft localhost/squash-all-layers:latest -o table=/dev/null
 ✔ Loaded image                                                                                                                                                           localhost/squash-all-layers:latest
 ✔ Parsed image                                                                                                                      sha256:6a78fd79097acadb77e57cd1c32fca596c3addc3d99c77e4fc977032a2ab3eb2
 ✔ Cataloged contents                                                                                                                       6608654972fcc7d28136e3ecffca4bfe371d89f0737cef299bdf378c87146bcf
   ├── ✔ Packages                        [132 packages]
   ├── ✔ File metadata                   [5,418 locations]
   ├── ✔ Executables                     [809 executables]
   └── ✔ File digests                    [5,418 files]

❯ syft localhost/squash-all-layers:latest -s squash-with-all-layers -o table=/dev/null
 ✔ Loaded image                                                                                                                                                           localhost/squash-all-layers:latest
 ✔ Parsed image                                                                                                                      sha256:6a78fd79097acadb77e57cd1c32fca596c3addc3d99c77e4fc977032a2ab3eb2
 ✔ Cataloged contents                                                                                                                       6608654972fcc7d28136e3ecffca4bfe371d89f0737cef299bdf378c87146bcf
   ├── ✔ Packages                        [132 packages]
   ├── ✔ File metadata                   [40,226 locations]
   ├── ✔ Executables                     [1,618 executables]
   └── ✔ File digests                    [40,226 files]

It looks like when cataloging ~138 packages was found then before finalizing the number dropped to ~132, so that's good.

But I noticed these runs took different times -- 8 seconds vs 11 seconds, not a big difference, but given that this is a small and simple image it is worth looking at. I believe this is because we're essentially doing both a squashed scan + an all-layers scan implicitly, since the resolver will return all references from both resolvers (not deduplicating file.Locations by the way). This isn't a problem in and of itself, since it might be that any approach may need to do just this, but I think this explains the mechanisms of what's happening time-wise.

Also note that there are several more executables and files cataloged! This is concerning since this should be behaving no different than the squashed cataloger from a count perspective. It's not immediately apparent what is causing this but it is a large blocker for this change (at first glance I think it's because catalogers are creating duplicate packages and relationships, but only the packages are getting deduplicated, but not the relationships... this should be confirmed though).

After reviewing the PR there are a few problems that seem fundamental:

LocationMetadata is being altered in a way where it's aware of the method used from the resolver perspective. Furthermore, since this is used during formatting this means that implicitly the format must know about the method of collecting the packages. This is less than ideal as it's leaking concerns about how to find the data vs the data itself. IsSquashedLayer is less of a problem (though not ideal) but IsSquashedAllLayersResolver is the main problem here.
Converting an existing syft-json SBOM into the memory location would not have any information about IsSquashedLayer on the file.LocationMetadata structs. This incongruity may cause some confusion for us down the line, especially if there is downstream processing (in syft) that depends on these being accurate.
Package filtering is happening during formatting. This is a big one -- it looks like only the syft-json format has this filtering implemented, but porting to the remaining formats is not ideal. Ideally the SBOM object itself would be consistent before any formatters are used, so it hints at this kind of work being done further upstream processing-wise than where it is now. It might be that this could be refactored so that this is happening within the package cataloging task itself instead of downstream in the formatters.
The package collection is aware of the file resolver method used, which is another abstraction leak -- it should only really know about packages, not how the packages were discovered. If there were a more resolver-agnostic definition that was core to what the package collection does then that would be different, but an equivalent configuration name hasn't come to mind yet.

What's the path forward from here? I think there is a chance of modifying this PR to get it to a mergable state, but it would require looking into the following things:

Probably migrate all package filter logic from the formatter to it's own task. Today we have a set of tasks that are run to coordinate how the SBOM is populated. There are some "post tasks" that always run after cataloging (such as the relationship task) that does additional work to what is already in the SBOM. It seems like we could make a new task that would be just after package cataloging that would be responsible for actively removing packages that shouldn't be in the SBOM at all (as well as their relationships). This change should be enough to remove the need for IsSquashedAllLayersResolver on the file.LocationMetadata object.
Directly related to the previous point 1, is there a way to remove the IsSquashedLayer bool on file.LocationMetadata? I don't obviously see a way to do this, but this should be addressed per fundamental problem 2. Should we expose this information to JSON? should these be another location annotation instead? Can we convince ourselves that this is a cataloging/post-cataloging/pre-formatting concern entirely, which would make this a non-point and we don't need to solve for it.

The following changes would additionally be needed:

any []file.Location returned by the new resolver needs to be deduplicated. We have file.LocationSet that should help with this, but it should be noted that the current implementation is noting squashed=true and false for potentially the same layer information, so this needs to be considered.
we need to deduplicate the relationships coming from the several duplicate file parser calls that's happening. I think it would need to occur after we know what the duplicates are, which means there is some interaction with the package cataloger here that is non obvious.

@tomersein shout out if you want to sync on this explicitly, I'd be glad to help. A good default time to chat with us is during our community office hours. Our next one is going to be this thursday at noon ET . If that doesn't work we can always chat through discourse group topics or DMs to setup a separate zoom call.

TimBrown1611 · 2024-10-08T06:01:07Z

Hi @wagoodman
Thanks for the comments. I will not be available to work on it this month, but i will want to develop it after that.
It will be helpful to discuss all open issues so once i am available again i can work on it, i will watch the discussion on youtube but i will not be able to join myself :)

Please let me know if its ok!

tomersein added 2 commits August 20, 2024 08:13

add squash all layers resolver

707f4c8

Signed-off-by: tomersein <[email protected]>

Merge branch 'main' into squashed-all-layers

df47870

tomersein mentioned this pull request Aug 20, 2024

Enhance scoping selections #15

Open

This was referenced Aug 20, 2024

Add ability to see the first location a package was added #1724

Draft

Provide a way to get the LayerID the package was first found in #435

Open

willmurphyscode assigned wagoodman Sep 3, 2024

tomersein mentioned this pull request Sep 14, 2024

Add layerID to package struct in report anchore/grant#52

Open

tomersein added 4 commits September 20, 2024 17:59

squashed all layers

b22d36a

Signed-off-by: tomersein <[email protected]>

squashed all layers

f148dda

Signed-off-by: tomersein <[email protected]>

squashed all layers

1d27acc

Signed-off-by: tomersein <[email protected]>

squashed all layers

bc9f627

Signed-off-by: tomersein <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squashed all layers #3138

Squashed all layers #3138

tomersein commented Aug 20, 2024 •

edited

Loading

dbrugman commented Aug 20, 2024

tomersein commented Aug 20, 2024

dbrugman commented Aug 20, 2024

kzantow commented Aug 29, 2024

tomersein commented Aug 29, 2024 •

edited

Loading

TimBrown1611 commented Sep 12, 2024

tomersein commented Sep 20, 2024

wagoodman commented Oct 7, 2024

TimBrown1611 commented Oct 8, 2024

Squashed all layers #3138

Are you sure you want to change the base?

Squashed all layers #3138

Conversation

tomersein commented Aug 20, 2024 • edited Loading

dbrugman commented Aug 20, 2024

tomersein commented Aug 20, 2024

dbrugman commented Aug 20, 2024

kzantow commented Aug 29, 2024

tomersein commented Aug 29, 2024 • edited Loading

TimBrown1611 commented Sep 12, 2024

tomersein commented Sep 20, 2024

wagoodman commented Oct 7, 2024

TimBrown1611 commented Oct 8, 2024

tomersein commented Aug 20, 2024 •

edited

Loading

tomersein commented Aug 29, 2024 •

edited

Loading