Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

Draft
wants to merge 87 commits into
base: main
Choose a base branch
from

Conversation

sharkinsspatial
Copy link
Collaborator

@sharkinsspatial sharkinsspatial commented Apr 22, 2024

This is a rudimentary initial implementation for #78. The core code is ported directly from kerchunk's hdf backend. I have not ported the bulk of the kerchunk backend's specialized encoding translation logic but I'll try to do so incrementally so that we can build complete test coverage for the many edge cases it currently covers.

@sharkinsspatial sharkinsspatial marked this pull request as draft April 22, 2024 18:37
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great so far @sharkinsspatial !

kerchunk backend's specialized encoding translation logic

This part I would really like to either factor out, or at a least really understand what it's doing. See #68

@@ -0,0 +1,206 @@
from typing import List, Mapping, Optional

import fsspec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one need fsspec if reading a local file? Is there any other way to read from S3 without fsspec at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not with a filesystem-like API. You would have to use boto3 or aiobotocore directly.

This is one of the great virtues of fsspec and is not to be under-valued.

Comment on lines 188 to 191
def virtual_vars_from_hdf(
path: str,
drop_variables: Optional[List[str]] = None,
) -> Mapping[str, xr.Variable]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this an a way to interface with the code in open_virtual_dataset

@rabernat
Copy link
Collaborator

This looks cool @sharkinsspatial!

My opinion is that it doesn't make sense to just forklift the kerchunk code into virtualizarr. What I would love to see is an extremely tight, strictly typed, unit-tested total refactor of the parsing logic. I think you're headed down the right path, but I encourage you to push as far as you can in that direction.

@TomNicholas TomNicholas added enhancement New feature or request references generation Reading byte ranges from archival files labels Apr 22, 2024
@sharkinsspatial
Copy link
Collaborator Author

@rabernat Fully agree with your take above 👆 👍 . I'm trying to work through this incrementally whenever I can find some spare time. In the spirit of thorough test coverage 🎊 looking through your issue pydata/xarray#7388 and the corresponding PR I'm not sure what the proper incantation of variable encoding configuration is to use blosc with the netcdf4 engine? Do you have an example of this that you can provide?

@TomNicholas TomNicholas mentioned this pull request May 14, 2024
6 tasks
@sharkinsspatial
Copy link
Collaborator Author

@TomNicholas I'll try to get #261 incorporated in this branch this week. Before tackling this, I'm still dealing with this failing test. I'm still a bit unclear on the mechanics here but I think I may understand. I was previously ignoring empty HDF datasets (which is the same behavior as kerchunk's reader) but to pass the tests introduced in #205 I introduced empty variable support. But as a side effect, this causes a roundtripping failure since these empty variables now get treated as coordinates.

This is related to @keewis's recent #260. So I think I have 2 questions

  1. Is this the correct way to represent a variable for an empty HDF5 dataset or should I consider some other sort of "empty" ManifestArray for the data?
  2. As I migrate this to align with Split kerchunk reader up #261 what approach should I use to have these variables not be treated as coordinates so that we can have successful roundtripping?

@TomNicholas
Copy link
Member

Is this the correct way to represent a variable for an empty HDF5 dataset

I actually don't know, but I would find out by creating a HDF file with an empty variable, opening it with xarray, and seeing if that result is represented the same way as what you have done here.

However if the definition of empty variable should just mean "a zarr array with no chunks, just the default fill_value" then I would say you should use the empty manifest approach that @keewis used in #260 instead though.

Ultimately what matters is that opening HDF via xarray and opening virtual zarr that points to HDF via xarray give the same result.


As I migrate this to align with #261 what approach should I use to have these variables not be treated as coordinates so that we can have successful roundtripping?

VirtualiZarr has full control over which variables we choose to make coordinates. I've tried to isolate that logic by in

def separate_coords(
, where coord_names are the names of variables explicitly labeled as COORDINATES in the files' variable-level metadata. Does that help?

these empty variables now get treated as coordinates.

You mean the fact they are not ManifestArrays means that if they are 1D they get treated as coordinates?

FYI that re-implementation of determining coordinates is the reason there is a bug that causes an inconsistency with xarray's behaviour, see #224.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding enhancement New feature or request references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants