Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace this package with a VirtualiZarr reader? #337

Open
TomNicholas opened this issue Aug 9, 2024 · 5 comments
Open

Replace this package with a VirtualiZarr reader? #337

TomNicholas opened this issue Aug 9, 2024 · 5 comments

Comments

@TomNicholas
Copy link

TomNicholas commented Aug 9, 2024

I don't know anything really about the format of MITgcm output files other than that they are some bespoke binary format, but I can't help wonder if it would actually be easier to create a cloud-optimized version of MITgcm data by writing a reader for virtualizarr (i.e. a kerchunk reader) rather than actually converting the binary data to zarr.

The advantages would be that

  • if you want to make the data available to xarray users, even in the cloud, you don't have to alter or duplicate the original data (for cloud access you could just upload the original output files to a bucket with no alterations),
  • that reader would work for any MITgcm output (so effectively replacing most of xMITgcm),
  • it would mean that creating the over-arching actual virtual zarr store becomes the same problem that everyone else has (that the rest of the virtualizarr package is meant to solve).

It would involve essentially rewriting this function

def read_mds(fname, iternum=None, use_mmap=None, endian='>', shape=None,

to look like either one of the kerchunk readers or ideally more like this
zarr-developers/VirtualiZarr#113

Because it seems MITgcm output already separates metadata from data to some degree this could potentially work really nicely...

See also zarr-developers/VirtualiZarr#218

One downside of that approach would be the inability to alter the chunking though.

cc @cspencerjones

@TomNicholas
Copy link
Author

Turns out there is already an issue discussing something very similar (which didn't appear when I searched "kerchunk") - see #28 (comment).

@cspencerjones
Copy link
Contributor

I've been thinking about this, and I'm not 100% sure that it's a good idea in the end. The main issue is that most MITgcm output is not compressed at all, so direct upload to the cloud may not be something we want to encourage, especially for realistic geometry simulations which contain a lot of land (compression usually does not reduce the size of ocean output very much). The upside of the format is that flexible chunking should be possible in theory.

LLC2160 & 4320 data is in a bespoke "shrunk" (still binary) format, where the land points have been removed, so further compression would have very limited benefit. But reading it would require writing code that's very specific to this dataset. I do not believe that further datasets will be generated in this bespoke format. Some of the data access problem with this data has nothing to do with format and is simply caused by the limited bandwidth out of Pleiades. Still, given the choice between a general MITgcm reader and a more specific reader for LLC2160/4320, I think a more specific reader would be most useful because this data is still by far the heaviest lift most people are doing, and many people cannot use the data because of how difficult access still is. (This is all just my opinion and I am prepared to hear other arguments)

@rabernat
Copy link
Member

I actually started something like this three years ago! https://github.com/rabernat/mds2zarr - of course VirtualiZarr is much better and more robust approach.

I agree with @cspencerjones that the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible.

However, that is really an edge case--most "normal" MDS data output from MITgcm should be perfectly fine as uncompressed flat binary.

@TomNicholas
Copy link
Author

the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible.

This seems like an analogous problem to zarr-developers/zarr-specs#303 - i.e. it could be solved by defining a special zarr codec that is specific to this data format.

@rabernat
Copy link
Member

Except it's really complicated because the "codec" for decoding each array relies on an external dataset (the null mask) which doesn't even have the same shape as the data. This breaks many of the abstractions implicit in the "codec" interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants