Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to implement alternative proposal for namespaceMap #491

Merged
merged 5 commits into from
Sep 20, 2023

Conversation

goneall
Copy link
Member

@goneall goneall commented Sep 5, 2023

This is an attempt to implement the proposal documented in issue #489 as a pull request on top of pull request #490

I took the approach of a pull request on top of the alternative pull request since the structure is basically the same. It is just the descriptions which are different.

Note that the basic structure is also the same as pull request #403 - the main difference being the names of the properties and classes.

@goneall
Copy link
Member Author

goneall commented Sep 5, 2023

@maxhbr - I tried to implement your proposal as a pull request.

This is based on a branch in the github repo - feel free to update the branch to better represent your proposal before the tech call.

@davaya
Copy link
Contributor

davaya commented Sep 5, 2023

Re: X-Collection.md:

Information we wish to preserve about the serialization itself are store as properties of this class.

Information we wish to preserve about the serialized data itself are stored as properties of this class.

NOTE: This class is not intended to be serialized itself.

The use case for ExternalMap still needs to be considered. It must be removed from ElementCollection if we want to avoid locking SBOMs to a single serialized data instance.

That means it must be put somewhere else, and X-Collection is a candidate for holding it. In that case, X-Collection instances would be serialized by ConsumerProducers who reference external serialized data from their serialized data.

@maxhbr
Copy link
Member

maxhbr commented Sep 5, 2023

I started it at some point in #479

must be represented in that format "native" to the serialization.
The NamespaceMap itself will never be serialized as part of SPDX data if the serialization format support namespaces or prefixes.
If the serialization format does not support prefixes, then the full URI's for the elements must be used and the namespace map will not be preserved.
Any custom serialization format SHOULD implement namespaces in order to preserve the namespace map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part should be described in X-Collection.md. The X-Collection should contain a description how it is expected to be created when deserializing a blob.

Copy link
Contributor

@davaya davaya Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The NamespaceMap itself will never be serialized as part of SPDX data if the serialization format support namespaces or prefixes.
  • If the serialization format does not support prefixes, then the full URI's for the elements must be used and the namespace map will not be preserved.

The first requirement can be terminated at "The NamespaceMap itself will never be serialized in an X-collection element." The "if the serialization format supports prefixes" part is rendered moot by the second requirement.

The X element created by a Consumer upon parsing a payload A may be serialized to allow Consumer's payload B to reference elements in payload A instead of serializing copies of them from Consumer's model store.

Signed-off-by: Gary O'Neall <[email protected]>
@davaya
Copy link
Contributor

davaya commented Sep 7, 2023

September 6 - What we agree on:

  • A subset of the use cases (some new ones were added in PR's which haven't been discussed)
  • NamespaceMap will be in the model
    • Namespaces are portions of Element ID IRI's
  • The model will always use the full IRI in every element
  • NamespaceMap are specific to supporting serialization (check)
  • All model proposals have an "X-Collection" as an outer shell
  • NamespaceMaps are only useful in collections
  • NamespaceMaps can only be used on collections
  • Serialized data is composed of one or more element values that may or may not be included in an ElementCollection

Bob articulated the distinction between capturing prefixes in model data after serialization or before:

  1. Prefixes can be tied to a specific serialization instance "payload"
  2. Prefixes can be tied to a specific set of elements that are not (yet) serialized
  3. Not tied to any set of elements - just part of the model <- out of scope for current SPDX release

The identical model can support both 1 and 2:

  1. Prefixes are serialized using format-specific syntax; serialized data instances are described by SpdxDocument elements
  2. Prefixes are defined in model data as a property of ElementCollection (Bundle, Sbom, etc) and SpdxDocument elements.

The distinction is operational. In case (2) generating an outer X element, those elements will continue to accumulate in the model data as collections use other collections: a producer creates Sbom1 with namespaceMap, another producer creates an Sbom2 with namespaceMap, a third producer creates a Bundle / Sbom3 that includes Sbom1 and Sbom2 with a third namespaceMap, and as the model graph continues to grow, all of the previously outer collections become inner when they are added to a new outer collection.

In case (1) SpdxDocument elements are created only as needed when Sbom3 references a document containing Sbom1. Sbom1 doesn't have to contain namespaceMap at all, so nested namespaceMap data doesn't build up in the model graph as reference chains grow in length.

In the harmonized model namespaceMap is an optional property of ElementCollection and an optional property of SpdxDocument, and if present their instances are hints to be considered when serializing documents and creating SpdxDocument elements. The term "X" isn't a new box in the model, it is a reference to either SpdxDocument (operational case 1) or ElementCollection (operational case 2).

If namespaceMap isn't populated in any element, round-tripping still works across all serialization formats that support it, and for all serialization formats period if every format is required to define its representation.
Persistence is implemented at the serialized data layer without involving the logical model at all, and all of this discussion becomes unnecessary.

I suspect that pre-serialization NamespaceMaps (case 2) will lean toward case 3 - short prefix strings mapped to short URIs that require long local names, while post-serialization NamespaceMaps for case 1 will use short prefix strings mapped to long URIs that allow short local names within the serialized document, and are thus significantly more effective at shortening those documents.

@davaya
Copy link
Contributor

davaya commented Sep 7, 2023

Payload: Sean objects to the term payload because of the implication that it means a file. But we have been careful to articulate that payload includes all methods of transferring data including streaming sessions and online data stores.

Nodes in the element graph are immutable, but nodes will be added over time. If "serialization" is used to refer to database storage, then a serialized data unit is a specific storage transaction, and prefixes can be used to reduce stored data and/or transmission size just as they reduce the size of data serialized into files. The write transaction entry is the database equivalent of a filesystem file entry or inode, and the bytes returned by reading the database transaction are equivalent to the bytes returned by reading a filesystem file. The namespaceMap used in a write transaction is returned when reading it.

If the database software does not support immutable state as of a specific transaction, then it cannot implement an immutable element graph, with or without prefixed IRI compaction.

@maxhbr
Copy link
Member

maxhbr commented Sep 7, 2023

... then it cannot implement an immutable element graph ...

I do not understand the issue. At even in a moving and alive DB every element is immutable and therefore every Collections scopes out an immutable sub-graph within the DB. Isn't that sufficient?

@davaya
Copy link
Contributor

davaya commented Sep 7, 2023

even in a moving and alive DB every element is immutable ...

The definition of Payload is trivial:

  1. It is a sequence of bytes - it has a byte count and a value, and instances are equal if and only if each byte in the instances are equal.
  2. Parsing the sequence yields a set of SPDX element values.

Sean objects to the terms Payload and Unit of Transfer for some reason that he'll have to explain, but "Byte Sequence" doesn't require anything to be "transferred", so whatever use cases he has in mind (databases were mentioned) are addressed by a sequence of bytes, and whatever name he wants to put on the box called "Payload" that means "Serialized Data" is fine with me.


But to consider databases, there is a difference between OLTP and OLAP. Transaction processing databases optimize for write speed and modify existing data. Analytic databases are optimized for read speed, but I don't know if they can guarantee WORM behavior.

Does a specific database support creation of an "immutable" (reproducible) subgraph by parsing serialized data? And does the database support metadata such as NamespaceMap that can be read by applications separately from graph content? If so, then the database supports "Payloads".

@goneall
Copy link
Member Author

goneall commented Sep 19, 2023

This is a reply to @davaya comment in issue #478 - putting it here since it is more related to the namespaceMap discussion than the "where do we start" issue.

So if namespaceMap can be serialized using JSON-LD context, can't rootElement also be serialized that way, without adding an artificial wrapper element to the graph?

I don't know of anything in the native serialization JSON-LD format that can serve as the rootElement.

If we think of the "X-Collection" as the creators expression of what a serialized blob of data is about, it doesn't feel as artificial. The challenge only comes in when the "X-Collection" conflicts with something supported by the native serialization - such as prefixes we use in the namespaceMap.

What if we think about it this way:

  • The "X-Collection" is a class which captures important information about a serialized blob of SPDX data
  • When the creator of an "X-Collection" serializes the "X-Collection" itself, any fields which are already defined in the serialization format being used MUST use the native serialization of that data. For example, if a serialization format has a mapping of prefixes, that native serialization mapping of prefixes MUST be used and the "X-Collection" MUST NOT contain any (potentially conflicting and redundant) prefix definitions.
  • In the SPDX specification, we document for each supported serialization format which "X-Collection" properties must use the native format. For example, we would have a JSON-LD section that would state the namespaceMap for the "X-Collection" must use the context prefixes. If there are other "X-Collection" fields (such as rootElement) that have a native serialization equivalent, we would document that usage as well. In other words, what we came up with for "Solution-B" would apply to any properties of the "X-Collection" that would potentially conflict.
  • Any "X-Collection" properties not conflicting with the native serialization formats would be serialized as properties of the "X-Collection" just like any other class.
  • The "X-Collection" would be "minted" by the creator of the serialized blob of SPDX data and be given an ID which is unique and represents all the associated properties of the "X-Collection" whether they were serialized in the "X-Collection" or if it used the native serialization as documented in the serialization portion of the SPDX spec.
  • When deserializing the "X-Collection", the same serialization rules apply to how the "X-Collection" is reconstituted - for example, you would construct the namespace map from the JSON-LD context.
  • If you reserialize the same "X-Collection" in a different format, you may end up serializing a different set of properties. Since we are not changing the "X-Collection" we would use the same "X-Collection" ID. We don't need to mint a different ID since it really is the same object, just serialized in a different way. This solves @maxhbr concern on having to "re-mint" the "X-Collection"

@goneall goneall merged commit a4351ca into namespacemap_reconstitution Sep 20, 2023
1 check passed
@davaya
Copy link
Contributor

davaya commented Sep 20, 2023

@goneall: I like Bob Martin's terminology: Solution A defines a SerializableCollection, meaning a specific set of elements before that exact set is serialized. Following that example, Solution B's metadata about a serialized instance would be SerializedElements. Obviously we want to put everything that applies to all formats in SerializableCollection, this would include "where do we start", but would exclude everything that applies only to one serialized instance of that collection, such as locationHint and verifiedBy.

But SerializableCollection is not a special X-Collection or Y-Collection created just for JSON-LD, it should be a normal part of the model. It is an Sbom element if the collection is just an Sbom, or an optional* Bundle if there is a set of Sboms, Actors or Licenses. Since the wrapper X-Collection arose in response to problems with NamespaceMap in JSON-LD, I still suggest removing NamespaceMap from every element (no property in the model has NamespaceMap as its Range) and every serialization spec is required to support it. This supports every use case we identified for NamespaceMap including the lowest priority: provenance. There is no "hop to hop" or "latest received" namespace map, there is always just the map used by the producer of every serialized SPDX data instance, and that map can be re-applied to every other serialized format of that exact set of elements.

* The serialized instance doesn't need to have a Bundle, or a Bundle wrapped in an X-Collection, it can be just a bunch of Sboms or Licenses with no wrapper at all if there's no reason to permanently name in the Element Store a set of 5 elements as being different from any other set of elements except that they were included in a specific serialized instance described by SpdxElements. If there is a reason to name that set of 5 (such as to designate the root Sbom in a tree of included Sboms), then it can be a SerializableCollection Bundle, but still doesn't need an outer X-Collection wrapping the Bundle.

The Playground derived from the Model notes use cases include simple examples such as:

  • Sbom with two files
  • two Persons
  • Bundle of two Persons

The use cases can be extended to all of the flows/issues discussed here, such as:

  • Bundle of three Sboms designating where to start
  • Two Bundles of Sbom subtrees designating the root of each, plus some unrelated Persons

And by looking at serialized instances in all formats of these use cases understand the difference between SerializedElements describing serialized data containing two Person elements vs. two Person elements plus a Bundle element, and SerializableCollection/Bundle of three Sboms designating where to start.

@goneall
Copy link
Member Author

goneall commented Sep 20, 2023

@davaya - Let's revisit this after deciding how we handle issue #478 - how we handle "where we start" may influence how we treat this solution

@davaya
Copy link
Contributor

davaya commented Sep 20, 2023

@goneall - You're the facilitator, but I don't think separate issues (like NamepaceMap, or Where we start) can be decided independently - the solutions interact holistically. Bob provisionally accepted Solution B as a way to make progress toward RC2, and Solution A as refined by Bob to a specific collection of elements, not "intent" applicable to future collections of elements by the same or different producers, was a critical breakthrough in discovering how to apply Solution A to #478.

@davaya
Copy link
Contributor

davaya commented Oct 17, 2023

PR #500 includes support for both where to start (rootElement) and moving from element-level to document-level dataLicense.

@bact bact deleted the namespacemap-not-serialized branch August 28, 2024 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants