Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squashed all layers #3138

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

tomersein
Copy link
Contributor

@tomersein tomersein commented Aug 20, 2024

This PR tries to solve the squash-with-all-layer resolver issue, aligned to the newest version of syft.
Please let me know how to proceed further, I guess the solution here is not perfect, but it does knows how to handle deleted packages.

part of - #15

@dbrugman
Copy link

@tomersein - I know very little about the Syft internals, and I'm trying to understand this PR. From the code and comments I understand that the new option will catalog packages from all layers, but then only include packages that are visible in the squashed file-system. How is that different from the regular squashed scope (or, I could probably rephrase this to: what is the difference between 'cataloging' and 'including')?

My main concern is whether this would (eventually) help to fix issue #1818

Many thanks!

@tomersein
Copy link
Contributor Author

hi @dbrugman ,
In this PR I am trying to display only packages which exists in the squashed layer, and in case they are, to include all of the layers they exist in so we can track down in which layer they were added.

@dbrugman
Copy link

Got it, thanks @tomersein

@kzantow
Copy link
Contributor

kzantow commented Aug 29, 2024

Hi @tomersein -- thanks for the contribution. I don't think we would want to merge this as-is, though. I wonder if there are any other things we may be able to do in order for you to accomplish what you're hoping to achieve.

So I understand correctly: the use case is to be able to find the layer which introduced a package, right?

@tomersein
Copy link
Contributor Author

tomersein commented Aug 29, 2024

yes correct @kzantow , let me know what are the gaps so I can push some fixes \ improvements.
I want to add some more information according to your meeting yesterday:

  • the advantage in this solution that you need to scan only once. When an end user wants to see vulnerabilities in his container, all-layers can make him confuse since some of them doesn't exist anymore.
  • This solution can help users to fix their vulnerabilities by updating the relevant layer the vulnerability started from.

@kzantow - please see my notes after the meeting yesterday
@wagoodman I am available to do some fixes in case needed, just let me know :)

@TimBrown1611
Copy link

any update? :) @wagoodman

Signed-off-by: tomersein <[email protected]>
Signed-off-by: tomersein <[email protected]>
Signed-off-by: tomersein <[email protected]>
Signed-off-by: tomersein <[email protected]>
@tomersein
Copy link
Contributor Author

did some static analysis corrections and all checks are now passed
@kzantow @wagoodman

@wagoodman
Copy link
Contributor

@tomersein thank you for submitting a candidate solution to solve the problem of tracking the layer-of-first-attribution problem.

Let me first summarize how this PR is achieving attribution. The first change involves adding a new file Resolver, which makes use of the squashed resolver and all-layer resolver based on the use case. The second change is adding IsSquashedAllLayersResolver and IsSquashedLayer bools to the core LocationMetadata struct. The new file resolver will raise up locations that positively confirm if the location is from the squashed layer and if the new resolver is used. The last change is to update the syft json formatter to filter out all packages that has no locations from a squashed layer. This in combination with the existing deduplication logic would yield in the same number of packages found but additional layer attributions.

Take for example a (rather silly) Dockerfile:

FROM ubuntu:latest
RUN apt update -y
RUN apt install -y jq
RUN apt install -y vim
RUN apt install -y wget curl

And after build:

$ docker inspect localhost/squash-all-layers | jq '.[].RootFS.Layers'
[
  "sha256:c26ee3582dcbad8dc56066358653080b606b054382320eb0b869a2cb4ff1b98b",
  "sha256:5ba46f5cab5074e141556c87b924bc3944507b12f3cd0f71c5b0aa3982fb3cd4",
  "sha256:1fde57bfea7ecd80e6acc2c12d90890d32b7977fec17495446678eb16604d8c7",
  "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
  "sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756"
]

Here we'll have multiple copies of the DPKG status file, which means classically we'll use the last layer for all evidence locations for packages (at least when it comes to the primary evidence location for the status file).

Let's take a look at just vims locations:

cat /tmp/after.json| jq '.artifacts[] | select(.name == "vim").locations'
[
  {
    "path": "/usr/share/doc/vim/copyright",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/usr/share/doc/vim/copyright",
    "annotations": {
      "evidence": "supporting"
    }
  },
  {
    "path": "/var/lib/dpkg/info/vim.md5sums",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/var/lib/dpkg/info/vim.md5sums",
    "annotations": {
      "evidence": "supporting"
    }
  },
  {
    "path": "/var/lib/dpkg/status",
    "layerID": "sha256:4097f47ebf86f581c9adc3c46b7dc9f2a27db5c571175c066377d0cef9995756",
    "accessPath": "/var/lib/dpkg/status",
    "annotations": {
      "evidence": "primary"
    }
  },
  {
    "path": "/var/lib/dpkg/status",
    "layerID": "sha256:9b6721789a2d1e4cef4a8c3cc378580eb5df938b90befefba0b55e07b54f0c33",
    "accessPath": "/var/lib/dpkg/status",
    "annotations": {
      "evidence": "primary"
    }
  }
]

Note that we see the original layer the package was added (sha256:9b6...c33) as well as the final unrelated modification of the status file (sha256:409...756). This is great! This is exactly what we're looking for in terms of results. There might be some debate around including one and only one spot for primary evidence, but lets ignore that for now.

Here's what I see when running a before and after:

❯ syft localhost/squash-all-layers:latest -o table=/dev/null
 ✔ Loaded image                                                                                                                                                           localhost/squash-all-layers:latest
 ✔ Parsed image                                                                                                                      sha256:6a78fd79097acadb77e57cd1c32fca596c3addc3d99c77e4fc977032a2ab3eb2
 ✔ Cataloged contents                                                                                                                       6608654972fcc7d28136e3ecffca4bfe371d89f0737cef299bdf378c87146bcf
   ├── ✔ Packages                        [132 packages]
   ├── ✔ File metadata                   [5,418 locations]
   ├── ✔ Executables                     [809 executables]
   └── ✔ File digests                    [5,418 files]

❯ syft localhost/squash-all-layers:latest -s squash-with-all-layers -o table=/dev/null
 ✔ Loaded image                                                                                                                                                           localhost/squash-all-layers:latest
 ✔ Parsed image                                                                                                                      sha256:6a78fd79097acadb77e57cd1c32fca596c3addc3d99c77e4fc977032a2ab3eb2
 ✔ Cataloged contents                                                                                                                       6608654972fcc7d28136e3ecffca4bfe371d89f0737cef299bdf378c87146bcf
   ├── ✔ Packages                        [132 packages]
   ├── ✔ File metadata                   [40,226 locations]
   ├── ✔ Executables                     [1,618 executables]
   └── ✔ File digests                    [40,226 files]

It looks like when cataloging ~138 packages was found then before finalizing the number dropped to ~132, so that's good.

But I noticed these runs took different times -- 8 seconds vs 11 seconds, not a big difference, but given that this is a small and simple image it is worth looking at. I believe this is because we're essentially doing both a squashed scan + an all-layers scan implicitly, since the resolver will return all references from both resolvers (not deduplicating file.Locations by the way). This isn't a problem in and of itself, since it might be that any approach may need to do just this, but I think this explains the mechanisms of what's happening time-wise.

Also note that there are several more executables and files cataloged! This is concerning since this should be behaving no different than the squashed cataloger from a count perspective. It's not immediately apparent what is causing this but it is a large blocker for this change (at first glance I think it's because catalogers are creating duplicate packages and relationships, but only the packages are getting deduplicated, but not the relationships... this should be confirmed though).

After reviewing the PR there are a few problems that seem fundamental:

  1. LocationMetadata is being altered in a way where it's aware of the method used from the resolver perspective. Furthermore, since this is used during formatting this means that implicitly the format must know about the method of collecting the packages. This is less than ideal as it's leaking concerns about how to find the data vs the data itself. IsSquashedLayer is less of a problem (though not ideal) but IsSquashedAllLayersResolver is the main problem here.
  2. Converting an existing syft-json SBOM into the memory location would not have any information about IsSquashedLayer on the file.LocationMetadata structs. This incongruity may cause some confusion for us down the line, especially if there is downstream processing (in syft) that depends on these being accurate.
  3. Package filtering is happening during formatting. This is a big one -- it looks like only the syft-json format has this filtering implemented, but porting to the remaining formats is not ideal. Ideally the SBOM object itself would be consistent before any formatters are used, so it hints at this kind of work being done further upstream processing-wise than where it is now. It might be that this could be refactored so that this is happening within the package cataloging task itself instead of downstream in the formatters.
  4. The package collection is aware of the file resolver method used, which is another abstraction leak -- it should only really know about packages, not how the packages were discovered. If there were a more resolver-agnostic definition that was core to what the package collection does then that would be different, but an equivalent configuration name hasn't come to mind yet.

What's the path forward from here? I think there is a chance of modifying this PR to get it to a mergable state, but it would require looking into the following things:

  1. Probably migrate all package filter logic from the formatter to it's own task. Today we have a set of tasks that are run to coordinate how the SBOM is populated. There are some "post tasks" that always run after cataloging (such as the relationship task) that does additional work to what is already in the SBOM. It seems like we could make a new task that would be just after package cataloging that would be responsible for actively removing packages that shouldn't be in the SBOM at all (as well as their relationships). This change should be enough to remove the need for IsSquashedAllLayersResolver on the file.LocationMetadata object.
  2. Directly related to the previous point 1, is there a way to remove the IsSquashedLayer bool on file.LocationMetadata? I don't obviously see a way to do this, but this should be addressed per fundamental problem 2. Should we expose this information to JSON? should these be another location annotation instead? Can we convince ourselves that this is a cataloging/post-cataloging/pre-formatting concern entirely, which would make this a non-point and we don't need to solve for it.

The following changes would additionally be needed:

  1. any []file.Location returned by the new resolver needs to be deduplicated. We have file.LocationSet that should help with this, but it should be noted that the current implementation is noting squashed=true and false for potentially the same layer information, so this needs to be considered.
  2. we need to deduplicate the relationships coming from the several duplicate file parser calls that's happening. I think it would need to occur after we know what the duplicates are, which means there is some interaction with the package cataloger here that is non obvious.

@tomersein shout out if you want to sync on this explicitly, I'd be glad to help. A good default time to chat with us is during our community office hours. Our next one is going to be this thursday at noon ET . If that doesn't work we can always chat through discourse group topics or DMs to setup a separate zoom call.

@TimBrown1611
Copy link

Hi @wagoodman
Thanks for the comments. I will not be available to work on it this month, but i will want to develop it after that.
It will be helpful to discuss all open issues so once i am available again i can work on it, i will watch the discussion on youtube but i will not be able to join myself :)

Please let me know if its ok!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

5 participants