Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/write empty chunks #2429

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Oct 22, 2024

This PR adds a boolean array.write_empty_chunks value to the global config, and uses this value to control whether chunks that are "empty", i.e. filled with values equivalent to the array's fill value, are written to storage.

In zarr-python 2.x, write_empty_chunks was a property of an Array that users specified when creating the Array object. This had pros and cons which I'm happy to discuss if people are interested, but the tl;dr is that the cons of that approach are driving my decision in this PR to make write_empty_chunks a global runtime property accessible via the config API.

Usage looks something like this (donfig experts please correct me if there's a better way):

with config.set({"array.write_empty_chunks": write_empty_chunks}):
    arr[:] = fill_value

If people hate this, then we can definitely change this API. I'm very open to discussion here.

Also worth noting:

Our check for whether a chunk is equal to the fill value is pretty inefficient -- it's allocating a new array for every check invocation. This can definitely be made more efficient, in a stupid way by caching an all-fill-value chunk on the array instance and using that for the comparison, or a smarter way by doing the (chunk, fill_value) comparison without allocating a new array. But I think this is an effort for a separate PR.

closes #2409

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Copy link
Contributor

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach. Should we add some backwards compatibility thing for the write_empty_chunks kwarg in zarr.open?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 22, 2024

Also worth noting:

Our check for whether a chunk is equal to the fill value is pretty inefficient

I think that broadcast_arrays is smart in a couple of ways when it's an array, scalar operation.

In [41]: x = np.random.randn(10, 10, 10)

In [42]: x2, y = np.broadcast_arrays(x, 0)

In [43]: x is x2  # No copy of the array is created
Out[43]: True

In [44]: y.base  # Only a single value is allocated for the fill value array data.

It'd be nice to avoid the equality check when writing, at least under some circumstances, but I haven't thought of an easy way to do that.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 22, 2024

Should we add some backwards compatibility thing for the write_empty_chunks kwarg in zarr.open?

Are you thinking of something like a warning to guide people to use the configuration approach, if they pass in write_empty_chunks?

@normanrz
Copy link
Contributor

Yes, a warning and maybe even setting the config for them?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 22, 2024

Yes, a warning and maybe even setting the config for them?

I think a warning is a good idea but I'm hesitant to have any runtime code that sets config variables beyond the initial setup. IMO we are better off treating it as immutable, and leaving it to users to set. I think we can afford to just do a warning here because user code won't break if write_empty_chunks is set to the wrong value.

@normanrz
Copy link
Contributor

That sounds reasonable

@d-v-b d-v-b marked this pull request as ready for review October 22, 2024 16:37
@dcherian
Copy link
Contributor

dcherian commented Oct 22, 2024

It'd be nice to avoid the equality check when writing

I've forgotten the code path now, but if zarr creates the empty chunk using np.broadcast_to(self.fill_value, chunk_shape) when we might just check equality of .base or something like that.

@@ -331,6 +331,7 @@ async def write_batch(
value: NDBuffer,
drop_axes: tuple[int, ...] = (),
) -> None:
write_empty_chunks = config.get("array.write_empty_chunks") == True # noqa: E712
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about unpacking this config value so deep in the stack. I'd rather make this a property of the Array so that we can guarantee consistent write behavior after an Array has been initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather make this a property of the Array so that we can guarantee consistent write behavior after an Array has been initialized.

If we do that, then we also require that users create a brand new array if they want to write just some parts of the data with different empty chunks handling (or we introduce write_empty_chunks as a mutable attribute, which I would rather avoid)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v3] support write_empty_chunks
5 participants