Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-generate File metadata fields #198

Open
multimeric opened this issue Sep 23, 2024 · 9 comments
Open

Auto-generate File metadata fields #198

multimeric opened this issue Sep 23, 2024 · 9 comments

Comments

@multimeric
Copy link

If you crate.add_file() a local file, I think it makes sense to populate:

This seems to be done already for remote files, so it makes sense to do the same for local ones.

Some other fields like height, width and duration could be automatically determined for images and videos respectively, but this is a much harder and less essential.

Happy to help with this!

@rsirvent
Copy link
Contributor

On contentSize I agree, it's just a system call to get a number.

However, generating the sha256 of a file can take quite some time, especially if the file is large. Imagine if you have to do the same for 10000 files... This could lead to unwanted overheads when creating an RO-Crate.

@multimeric
Copy link
Author

Maybe a boolean option to calculate the sha256, which defaults to false then?

@simleo
Copy link
Collaborator

simleo commented Sep 25, 2024

Even a system call can introduce a consistent overhead, so we'd need boolean options for both properties. However, the FileOrDir and File code are already complicated enough, and a library user that needs to set those properties can easily compute them in client code and pass them through the properties argument:

size = ...
checksum = ...
crate.add_file(source, dest, properties={
    "contentSize": size,
    "sha256": checksum
})

It's better for the library to stay lean and leave optional stuff like this to client code.

In the case of remote files, note that contentSize is only added if validate_url is True and it's copied from the Content-Length HTTP header that we already have from checking the URL.

@multimeric
Copy link
Author

So that's a no to adding these arguments? The annoying thing for end users is that we have to do the size and checksum calculation for every file, and there are likely to be many of these. So without a built-in argument, I suspect many of us will end up implementing our own wrapper functions to do this.

@simleo
Copy link
Collaborator

simleo commented Sep 25, 2024

So that's a no to adding these arguments? The annoying thing for end users is that we have to do the size and checksum calculation for every file, and there are likely to be many of these. So without a built-in argument, I suspect many of us will end up implementing our own wrapper functions to do this.

Well, those properties are not required by the RO-Crate spec at any level, so some end users might want to add them while others might not. I'd rather not add the burden of more code to maintain to the library for something that's optional. I'll leave this open for others to chime in.

@simleo
Copy link
Collaborator

simleo commented Sep 26, 2024

Other things to consider:

  • sha256 is not in the RO-Crate 1.1 context (but it's present in the forthcoming 1.2)
  • One might want to use other checksum algorithms (e.g., sha512). These are also not in the RO-Crate 1.1 context (but several are present in the workflow-run context)

@kinow
Copy link
Member

kinow commented Sep 26, 2024

I think having flags to turning these fields (and others, width/height/etc.) would be the way to go.

On having ro-crate-py doing it, or staying lean and asking clients to implement this, maybe there could be other options too. Like having plugins in ro-crate-py, like ro-crate-py-fileutils or so. When installed, then that brings code that tries to populate file information, mime type, whatnot.

This way ro-crate-py focuses only on RO-Crate and Python, and anything more specific but that helps users/implementation-devs would go to these plug-ins, and implementations can decide to use it or not.

@simleo
Copy link
Collaborator

simleo commented Sep 30, 2024

Discussed with @stain last Thursday at the Workflow Run RO-Crate meeting. We decided to add contentSize: while it's not mentioned explicitly in the spec, the property shows up in many examples, so we can consider it kind of a recommendation. The implementation is in #201. Notes:

  • The record_size flag is in add_file and is propagated to FileOrDir.__init__, but the contentSize property is actually added in File.write when the file is written to disk. If we added it when the file entity is created, it could get invalidated multiple times (e.g. if data is appended to the file before the crate is written).
  • I ran a quick performance test that involved creating an RO-Crate with 10K files: the process is about 15% slower when contentSize is added to each File entity. For this reason I introduced the record_size flag and set its default to False.

Regarding the recording of a checksum such as sha256, we observed that:

  • it's not mentioned in the RO-Crate spec at all
  • there are too many types, and the terms are not in the RO-Crate context
  • users might want to record more than one checksum, e.g., sha256 and sha512.

So this is better left to user-level code.

@kinow
Copy link
Member

kinow commented Sep 30, 2024

A fair approach! Thanks for summarizing it here, @simleo !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants