Skip to content

Advanced resource entity implementation

Jurrian Tromp edited this page Dec 6, 2019 · 2 revisions

Elaborated on #184:

Summary
This describes resource provenance using two attributes which relate the ORI resource to the original resource so it can be used of metadata.

  • Internally we use the canonical_id and canonical_iri, which can be serialized and made public into a different form.
  • entity has been replaced by canonical_id and canonical_iri, see history.
  • Both canonical_id and canonical_iri may be specified in the same resource.
  • canonical_id can appear in the used_file to designate a subsection if it contains multiple nested resources
  • canonical_iri should designate as close as possible what resource was used. In the most simple case this is the URL of the resource that was retrieved. However, if that URL contains multiple resources we can 'guess' what the URL directly to the resource would be.

Description
Different resources can been derived from one entity, i.e. a meeting has multiple nested documents. These documents can be resolvable by their own URL but the original source of the resource is still the same. The same as its parent since this is our actual source (Resources that can be resolved (have their own identifier) should use that URL instead, see the comment below).

If possible, it should be identified with a URL, scheme and query parameters like this. It should represent the suppliers resource as they specify it, it should include a scheme (https:// by default) but no additional parameters. If the supplier does not specify it but we can assume the resource exists, we can construct the more specific URL ourselves. This makes it IRI's, which are most often URL's. This implies that we cannot assume they always resolve.

The canonical creates the bridge between the mapping IRI and the supplier's resource. In SOAP it is not possible to use URL's to identify a specific resource, in that case we do not have more information than the identifier itself so we use canonical_id, it would be something like '8984124'. The used_file would be the URL to our cached version of the SOAP response. In a later iteration we can use URL fragments to designate the identifier within the context of the cached version (this proves to be a problem with Google Storage document revision). We use canonical_id and canonical_iri fields since we need to serialize them as different attributes.

Some considerations:

  • When a subresource has an own URL, canonical_iri is used to specify. There is no direct relation between canonical_iri and used_file, the canonical refers to the specific resource while the used_file should be the cached version of the resource's parent.
  • When a subresource doesn't have an own URL, canonical_id is used to designate the subresource within the resource. There is a direct relation between canonical_id and used_file, since the id will always be in the scope of the cached file.
  • A downloadable document has a schema:contentUrl to the resolver, so used_file shouldn't refer the same cache URL. Instead it should refer to the file where the URL to the document was originally specified. Also, schema:isBasedOn set by the enricher refers to the document's original download URL. Canonical should refer to the same URL, except for when the following applies:
  • Some suppliers distinguish between a document resource URL and a document download URL. If this is the case, canonical_iri should be the resource URL and schema:isBasedOn should be the download URL.
  • Note that for canonical_iri the document resource URL is specified here as "self": "api.notubiz.nl/document/780972", without with the ?format=json&version=1.10.8. However we cannot add this information, it is up to the user to make the decision about which version and format to use. If possible, we want to give the user as much information how to find the actual resource we used, so it will be including at least the version query parameter but it would also be wise to include format as well. Sensitive query parameters like authentication should be left out.