Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split kerchunk reader up #261

Merged
merged 29 commits into from
Oct 19, 2024
Merged

Split kerchunk reader up #261

merged 29 commits into from
Oct 19, 2024

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Oct 17, 2024

Instead of trying to treat all uses of kerchunk references as one "reader", this PR instead splits them to consider each separate filetype to be one "reader", even if many of those readers call kerchunk code.

This means every "reader" is now a separate definition of a function open_virtual_dataset, which the top-level open_virtual_dataset picks between with one big match case statement. Generalizing that match to be pluggable by third-party libraries would close #245.

This also makes the structure of our dependence on kerchunk much clearer - @mpiannucci it should now be a lot easier for you to try out the kerchunk-to-icechunk use case mentioned in #258 (comment).

cc @norlandrhagen @sharkinsspatial @ghidalgo3

@TomNicholas TomNicholas added Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files internals labels Oct 17, 2024
@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 18, 2024

Note to self: I should just use an ABC with an open_virtual_dataset method defined at this point. That would standardize the interface, which would help shut mypy up, and it brings us closer towards what we would need for an entrypoint system.

EDIT: done

Comment on lines 2 to 12
from pathlib import Path
from typing import Any, MutableMapping, Optional, cast
from typing import Iterable, Mapping, Optional

import ujson
from xarray import Dataset
from xarray.core.indexes import Index
from xarray.core.variable import Variable

from virtualizarr.backend import FileType, separate_coords
from virtualizarr.manifests import ChunkManifest, ManifestArray
from virtualizarr.readers.common import VirtualBackend
from virtualizarr.translators.kerchunk import dataset_from_kerchunk_refs
from virtualizarr.types.kerchunk import (
KerchunkArrRefs,
KerchunkStoreRefs,
)
from virtualizarr.utils import _FsspecFSFromFilepath
Copy link
Member Author

@TomNicholas TomNicholas Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice the kerchunk reader never imports kerchunk @mpiannucci

Comment on lines +172 to +174
class VirtualBackend(ABC):
@staticmethod
def open_virtual_dataset(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that a @staticmethod is the best way to do this but for now it's just an internal implementation detail. The important thing is that the open_virtual_dataset function signature gets standardized by this approach.

@TomNicholas
Copy link
Member Author

@keewis you might have opinions on this PR - I'm trying to move towards a system of virtual "backend readers" like xarray has backends. My proposal (see #245) is to eventually have one actual xarray backend that calls virtualizarr, and virtualizarr calls one of a number of registered virtual backends depending on the filetype.

@TomNicholas
Copy link
Member Author

Note: ensure warning is raised on decode_cf

virtualizarr/backend.py Outdated Show resolved Hide resolved
@keewis
Copy link
Contributor

keewis commented Oct 18, 2024

you might have opinions on this PR

I'm still somewhat split on the idea to create a virtualizarr backend for xr.open_dataset: I think of xr.open_dataset as a way to open a dataset (i.e. metadata and data), while with virtualizarr we're not actually opening the dataset, we're just reading the metadata and the chunk locations. I'm aware that this distinction is somewhat blurred if you can load the actual data using the virtual dataset, but still I feel somewhat uncomfortable with that idea (no strong opposition, though).

However, I totally support the creation of a plugin system for open_virtual_dataset, as that means that the code for other packages does not have to live in the base package.

For an actual code review I'd have to read the changes you're proposing here, which I will try do at some point during the weekend (but the PR is huge, so might take me some time).

@TomNicholas
Copy link
Member Author

I'm still somewhat split on the idea to create a virtualizarr backend for xr.open_dataset: I think of xr.open_dataset as a way to open a dataset (i.e. metadata and data), while with virtualizarr we're not actually opening the dataset, we're just reading the metadata and the chunk locations. I'm aware that this distinction is somewhat blurred if you can load the actual data using the virtual dataset, but still I feel somewhat uncomfortable with that idea (no strong opposition, though).

I agree actually. I'm still not sure that it's a good idea. Maybe open_virtual_dataset is sufficient.

However, I totally support the creation of a plugin system for open_virtual_dataset, as that means that the code for other packages does not have to live in the base package.

👍

For an actual code review I'd have to read the changes you're proposing here, which I will try do at some point during the weekend (but the PR is huge, so might take me some time).

No worries! I'm pretty sure this is fine, it can also always be refactored further.

@keewis
Copy link
Contributor

keewis commented Oct 18, 2024

(the failing tests are caused by a bad merge, as the changes to variable_from_kerchunk_refs from #260 disappeared)

@TomNicholas
Copy link
Member Author

the failing tests are caused by a bad merge

Thanks for that heads up!

Note: ensure warning is raised on decode_cf

Actually the warning should be on cf_variables. decode_times should be passed through, which it now is. But I'm confused why that didn't cause failures...

Comment on lines +25 to +36
# TODO add entrypoint to allow external libraries to add to this mapping
VIRTUAL_BACKENDS = {
"kerchunk": KerchunkVirtualBackend,
"zarr_v3": ZarrV3VirtualBackend,
"dmrpp": DMRPPVirtualBackend,
# all the below call one of the kerchunk backends internally (https://fsspec.github.io/kerchunk/reference.html#file-format-backends)
"netcdf3": NetCDF3VirtualBackend,
"hdf5": HDF5VirtualBackend,
"netcdf4": HDF5VirtualBackend, # note this is the same as for hdf5
"tiff": TIFFVirtualBackend,
"fits": FITSVirtualBackend,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrjones I think this is what you were suggesting.

@TomNicholas
Copy link
Member Author

I'm going to merge this now because I know @sharkinsspatial and @ayushnag would like to work off of it, but @keewis feel free to add any comments / thoughts here and I will address them in follow-up PRs.

@TomNicholas TomNicholas merged commit 29ca4ac into main Oct 19, 2024
10 checks passed
@TomNicholas TomNicholas deleted the split_kerchunk_reader branch October 19, 2024 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals Kerchunk Relating to the kerchunk library / specification itself references generation Reading byte ranges from archival files
Projects
None yet
2 participants