Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying coordinate variables #105

Closed
TomNicholas opened this issue May 9, 2024 · 5 comments · Fixed by #156
Closed

Identifying coordinate variables #105

TomNicholas opened this issue May 9, 2024 · 5 comments · Fixed by #156

Comments

@TomNicholas
Copy link
Member

TomNicholas commented May 9, 2024

I'm a little unclear how xarray / netCDF coordinate variables are distinguished from data variables in the Zarr model.

Xarray follows the netCDF convention that 1-dimensional variables with the same name as their only dimension are to be treated as "coordinates", but it is also possible to have additional 1D or multi-dimensional coordinates (which netCDF calls "auxiliary coordinate variables").

I think at the very least we might need an additional line of code inside open_virtual_dataset that looks for any metadata attribute named 'coordinates'.

@TomNicholas
Copy link
Member Author

Turns out the way coordinates are specified in the kerchunk references format is as follows:

  • If there is one non-dimension coordinate its name is saved as a string in the top-level .zattrs, i.e. '.zattrs': '{"coordinates":"lat"}'.
  • If there is more than one non-dimension coordinate their names are saved not as a list of strings, but as just strings with spaces between them, i.e. '.zattrs': '{"coordinates":"lat lon"}'. Note this is a different syntax than how multiple array dimensions are recorded: '{"_ARRAY_DIMENSIONS":["x","y"]}'.
  • Dimension coordinates are not specified at all, so 1D variables with the same name as their only dimension will be interpreted as coordinates even when they don't have an entry in the .zattrs.

(None of this is actually described in the kerchunk references specification, I had to work this out by trial and error.)

@dcherian
Copy link

dcherian commented Jun 24, 2024

These are CF conventions and will/should be handled by Xarray. There's an open issue about implementing Dataset.encode_cf that might be useful here

And for your first question, Zarr doesn't distinguish between coordinate and data variables.

@TomNicholas
Copy link
Member Author

TomNicholas commented Jun 24, 2024

These are CF conventions

Oh okay, I didn't realise that. I'm struggling to find a self-contained description of this in the CF conventions doc but there is at least one example just above here.

will/should be handled by Xarray.

Currently because we build the virtual dataset variable-by-variable, and already have to do some sneaky stuff with the coordinates to avoid accidentally creating indexes (inside separate_coords), I think this has to be done manually within virtualizarr. I've implemented that in #156. It would be nice to delegate this to an xarray function though!

There's an open issue about implementing Dataset.encode_cf that might be useful here

Found it: pydata/xarray#4412

And for your first question, Zarr doesn't distinguish between coordinate and data variables.

So IIUC in a Zarr store coordinates would be specified to xarray in the same way, through this CF-convention format?

@dcherian
Copy link

https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#coordinate-system

The encode/decode machinery should not be creating indexes iiuc

@TomNicholas
Copy link
Member Author

TomNicholas commented Jun 24, 2024

It would be nice to delegate this to an xarray function though!

Okay I've raised #157 to track the xr.encode_cf/decode_cf idea. This issue will be closed by #156.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants