-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr : Allow setting write_empty_chunks
#8016
Conversation
- Incorrectly set default values - Need to set write_empty_chunks for new variables in group
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
Co-authored-by: Deepak Cherian <[email protected]>
for more information, see https://pre-commit.ci
Looks like this will require |
If I understand the version policy CI correctly, I think you can bump zarr to 2.12. |
- Default value of None which will fall back to encoding specified behavior or Zarr defaults - If param (!= None) and encoding setting disagree, a ValueError is raised - Test case checks for compatible zarr version - Documented minimum required Zarr version
Could you change the minimum requirement of zarr to 2.12 as well? It's here: xarray/ci/requirements/min-all-deps.yml Line 53 in bb501ba
If the Minimum Version Policy (min-all-deps) CI passes, you can simplify the test and add that change to the what's new. Here's a template you can use for that: Lines 537 to 558 in bb501ba
|
Zarr minimum dependency bump should make the version check no longer necessary
@Illviljan Done. Bumped zarr version + documented it in |
store, | ||
mode="w", | ||
encoding=encoding, | ||
write_empty_chunks=write_empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to allow this to be controlled on a per-variable basis? If so, encoding
would be the right place to specify it. I don't have an opinion, just a question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an option for Zarr array creation so it would originally be specified on a per-variable basis; however, the issue is that there is no way to specify it when appending data to an existing store, as there is an error raised when encoding
is not empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this reveals is that our usage of encoding
is overloaded. There are evidently two distinct types of information that we can put in encoding
:
- Specifications for how to store the data on disk, e.g. dtype, compression, etc, that must be consistent for all subsequent writes to the Zarr array
- Runtime choices, like
write_empty_chunks
, that can be different for each write (and that don't necessarily need to be known to read the data back)
Perhaps not necessary for this PR, but perhaps we could consider explicitly distinguishing these two distinct types of encoding. Maybe one solution is to eventually remove write_empty_chunks
and all such runtime choices from encoding
and have them only be supported as kwargs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rabernat That sounds like it would be a good idea. I'm sure there are probably more cases where this lack of distinction could cause a problem.
xarray/backends/zarr.py
Outdated
@@ -666,6 +671,8 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No | |||
# metadata. This would need some case work properly with region | |||
# and append_dim. | |||
zarr_array = self.zarr_group[name] | |||
if self._write_empty is not None: | |||
zarr_array._write_empty_chunks = self._write_empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a non-public attribute? There's no public alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I could find. There's a way to specify it when opening direct arrays but since xarray goes through zarr.hierarchy.open_group
(and there is no equivalent parameter there compared to zarr.creation.open_array
, nor kwargs to pass) there is no way to do so, at least from within xarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me nervous. How important is it that we support this feature on existing arrays? Alternatively, we can work on making the write_empty_chunks
property in Zarr public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to change how you create the array. Instead of using zarr_array = self.zarr_group[name]
, you can create the array directly, e.g.
zarr_array = zarr.open(store=self.zarr_group.store, path=f'{self.zarr_group.name}/{name}', write_empty_chunks=...)
This is longer, but you have complete control over the instantiation of the zarr array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give that a try, may be a good solution pending zarr-developers/zarr-python#1478
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: Illviljan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments that hopefully help move this forward. Can you open an issue with Zarr to make Array.write_empty_chunks
a public setter?
xarray/backends/zarr.py
Outdated
@@ -666,6 +671,8 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No | |||
# metadata. This would need some case work properly with region | |||
# and append_dim. | |||
zarr_array = self.zarr_group[name] | |||
if self._write_empty is not None: | |||
zarr_array._write_empty_chunks = self._write_empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me nervous. How important is it that we support this feature on existing arrays? Alternatively, we can work on making the write_empty_chunks
property in Zarr public.
Co-authored-by: Joe Hamman <[email protected]>
In response to #8016 (comment)
The issue is that the initial preference for an array is not kept when appending data to that array. In cases where the data is very sparse, this can lead to many unnecessary empty chunks being written which can have all sorts of implications. (#8009). Zarr appears to set this behavior when opening/creating arrays, but not when opening groups (which is how it's handled in the xarray backend), and it is defaulted to |
Thanks to @d-v-b's suggestion, we're no longer writing to a non-public attribute of the |
Thanks @RKuttruff @jhamman can we merge? |
Thanks @RKuttruff ! Welcome to Xarray! |
Thank you so much @dcherian! |
- Installs branch from pydata/xarray#8016 - Replaces `{var: {'write_empty_chunks': False}}` with `write_empty_chunks=False` kwarg in `Dataset.to_zarr` calls in `ZarrWriter`
Dataset.to_zarr
cannot specify to not write empty chunks when appending to existing store #8009Zarr has an attribute in
zarr.core.Array
that specifies the behavior desired in this issue. Added a parameter inDataset.to_zarr
,write_empty_chunks
to allow the user to explicitly set this behavior.Test case to ensure empty chunks are/are not written on append given the value of
write_empty_chunks
.whats-new.rst