Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr : Allow setting write_empty_chunks #8016

Merged
merged 21 commits into from
Aug 4, 2023

Conversation

RKuttruff
Copy link
Contributor

Zarr has an attribute in zarr.core.Array that specifies the behavior desired in this issue. Added a parameter in Dataset.to_zarr, write_empty_chunks to allow the user to explicitly set this behavior.

  • Tests added

Test case to ensure empty chunks are/are not written on append given the value of write_empty_chunks.

  • User visible changes (including notable bug fixes) are documented in whats-new.rst

rileykk added 2 commits July 24, 2023 12:39
- Incorrectly set default values
- Need to set write_empty_chunks for new variables in group
@welcome
Copy link

welcome bot commented Jul 24, 2023

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Jul 24, 2023
@dcherian dcherian changed the title Write empty chunks Zarr : Allow setting write_empty_chunks Jul 24, 2023
@RKuttruff
Copy link
Contributor Author

Looks like this will require zarr>=2.11 to have an effect; otherwise it will issue a warning of unrecognized kwarg on to_zarr. This seems to be what is causing the min_all_deps check to fail (+ the fact that the new test cases expect it to work when it shouldn't with that version of zarr). Is this something that would be better resolved by a check or bumping the minimum required zarr version?

@Illviljan
Copy link
Contributor

If I understand the version policy CI correctly, I think you can bump zarr to 2.12.

xarray/backends/zarr.py Outdated Show resolved Hide resolved
- Default value of None which will fall back to encoding specified behavior or Zarr defaults
- If param (!= None) and encoding setting disagree, a ValueError is raised
- Test case checks for compatible zarr version
- Documented minimum required Zarr version
@Illviljan
Copy link
Contributor

Could you change the minimum requirement of zarr to 2.12 as well? It's here:

If the Minimum Version Policy (min-all-deps) CI passes, you can simplify the test and add that change to the what's new. Here's a template you can use for that:

xarray/doc/whats-new.rst

Lines 537 to 558 in bb501ba

Breaking changes
~~~~~~~~~~~~~~~~
- The minimum versions of some dependencies were changed (:pull:`7300`):
========================== ========= ========
Package Old New
========================== ========= ========
boto 1.18 1.20
cartopy 0.19 0.20
distributed 2021.09 2021.11
dask 2021.09 2021.11
h5py 3.1 3.6
hdf5 1.10 1.12
matplotlib-base 3.4 3.5
nc-time-axis 1.3 1.4
netcdf4 1.5.3 1.5.7
packaging 20.3 21.3
pint 0.17 0.18
pseudonetcdf 3.1 3.2
typing_extensions 3.10 4.0
========================== ========= ========

rileykk added 2 commits July 25, 2023 14:59
Zarr minimum dependency bump should make the version check no longer necessary
@github-actions github-actions bot added CI Continuous Integration tools dependencies Pull requests that update a dependency file labels Jul 25, 2023
@RKuttruff
Copy link
Contributor Author

@Illviljan Done. Bumped zarr version + documented it in whats-new.rst

store,
mode="w",
encoding=encoding,
write_empty_chunks=write_empty,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to allow this to be controlled on a per-variable basis? If so, encoding would be the right place to specify it. I don't have an opinion, just a question

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an option for Zarr array creation so it would originally be specified on a per-variable basis; however, the issue is that there is no way to specify it when appending data to an existing store, as there is an error raised when encoding is not empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this reveals is that our usage of encoding is overloaded. There are evidently two distinct types of information that we can put in encoding:

  • Specifications for how to store the data on disk, e.g. dtype, compression, etc, that must be consistent for all subsequent writes to the Zarr array
  • Runtime choices, like write_empty_chunks, that can be different for each write (and that don't necessarily need to be known to read the data back)

Perhaps not necessary for this PR, but perhaps we could consider explicitly distinguishing these two distinct types of encoding. Maybe one solution is to eventually remove write_empty_chunks and all such runtime choices from encoding and have them only be supported as kwargs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabernat That sounds like it would be a good idea. I'm sure there are probably more cases where this lack of distinction could cause a problem.

@@ -666,6 +671,8 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No
# metadata. This would need some case work properly with region
# and append_dim.
zarr_array = self.zarr_group[name]
if self._write_empty is not None:
zarr_array._write_empty_chunks = self._write_empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a non-public attribute? There's no public alternative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I could find. There's a way to specify it when opening direct arrays but since xarray goes through zarr.hierarchy.open_group (and there is no equivalent parameter there compared to zarr.creation.open_array, nor kwargs to pass) there is no way to do so, at least from within xarray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me nervous. How important is it that we support this feature on existing arrays? Alternatively, we can work on making the write_empty_chunks property in Zarr public.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to change how you create the array. Instead of using zarr_array = self.zarr_group[name], you can create the array directly, e.g.

zarr_array = zarr.open(store=self.zarr_group.store, path=f'{self.zarr_group.name}/{name}', write_empty_chunks=...)

This is longer, but you have complete control over the instantiation of the zarr array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give that a try, may be a good solution pending zarr-developers/zarr-python#1478

xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Show resolved Hide resolved
@RKuttruff RKuttruff requested a review from Illviljan July 25, 2023 23:38
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments that hopefully help move this forward. Can you open an issue with Zarr to make Array.write_empty_chunks a public setter?

ci/requirements/min-all-deps.yml Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
@@ -666,6 +671,8 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No
# metadata. This would need some case work properly with region
# and append_dim.
zarr_array = self.zarr_group[name]
if self._write_empty is not None:
zarr_array._write_empty_chunks = self._write_empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me nervous. How important is it that we support this feature on existing arrays? Alternatively, we can work on making the write_empty_chunks property in Zarr public.

xarray/core/dataset.py Outdated Show resolved Hide resolved
@github-actions github-actions bot removed CI Continuous Integration tools dependencies Pull requests that update a dependency file labels Jul 27, 2023
@RKuttruff
Copy link
Contributor Author

@jhamman zarr-developers/zarr-python#1478

@RKuttruff
Copy link
Contributor Author

RKuttruff commented Jul 27, 2023

In response to #8016 (comment)

How important is it that we support this feature on existing arrays?

The issue is that the initial preference for an array is not kept when appending data to that array. In cases where the data is very sparse, this can lead to many unnecessary empty chunks being written which can have all sorts of implications. (#8009). Zarr appears to set this behavior when opening/creating arrays, but not when opening groups (which is how it's handled in the xarray backend), and it is defaulted to True. This change would give the user the choice as they would have if there were manually writing/appending to the arrays, rather than going through to_zarr.

@RKuttruff
Copy link
Contributor Author

Thanks to @d-v-b's suggestion, we're no longer writing to a non-public attribute of the Array object

@RKuttruff RKuttruff requested review from d-v-b and jhamman July 27, 2023 20:04
@dcherian
Copy link
Contributor

dcherian commented Aug 3, 2023

Thanks @RKuttruff @jhamman can we merge?

@dcherian dcherian merged commit 0343761 into pydata:main Aug 4, 2023
27 checks passed
@welcome
Copy link

welcome bot commented Aug 4, 2023

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@dcherian
Copy link
Contributor

dcherian commented Aug 4, 2023

Thanks @RKuttruff ! Welcome to Xarray!

@RKuttruff
Copy link
Contributor Author

Thank you so much @dcherian!

RKuttruff pushed a commit to EarthDigitalTwin/OCO3-data-transformer that referenced this pull request Nov 7, 2023
- Installs branch from pydata/xarray#8016
- Replaces `{var: {'write_empty_chunks': False}}` with `write_empty_chunks=False` kwarg in `Dataset.to_zarr` calls in `ZarrWriter`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In Dataset.to_zarr cannot specify to not write empty chunks when appending to existing store
6 participants